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In  this  dissertation,  we  develop  efficient  algorithms  for  three  problems  that  arise  in 
very  large  scale  integrated  computer-aided  design  (\"LSI  CAD):  (1)  transistor  folding, 
(2)  module  implementation  selection,  and  (3)  gate  resizing. 

Transistor  folding  reduces  the  area  of  row-based  designs  that  employ  transistors  of 
different  size.  Kim  and  Kang  have  developed  an  0(m2logm)  algorithm  to  optimally 
fold  m  transistor  pairs.  In  this  dissertation  we  develop  an  0(m2)  algorithm  for 
optimal  transistor  folding.  Our  experiments  indicate  that  our  algorithm  runs  3  to  50 
times  as  fast  for  m  values  in  the  range  [100,  100000]. 

We  develop  an  O(plogn)  time  algorithm  to  obtain  optimal  solutions  to  the  p 

pin  n  net  single  channel  performance-driven  implementation  selection  problem  in 

viii 


which  each  module  has  at  most  two  possible  implementations  (2-PDMIS).  Although 
Her,  Wang  and  Wong  have  also  developed  an  O(plogn)  algorithm  for  this  problem, 
experiments  indicate  that  our  algorithm  is  twice  as  fast  on  small  circuits  and  up  to 
eleven  times  as  fast  on  larger  circuits.  We  also  develop  an  0{pnc~x)  time  algorithm 
for  the  c,  c>  1,  channel  version  of  the  2-PDMIS  problem. 

We  study  the  problem  of  resizing  gates  to  reduce  overall  power  consumption  while 
satisfying  a  circuit's  timing  constraints.  Polynomial  time  algorithms  for  series-parallel 
and  tree  circuits  are  obtained.  Gate  resizing  with  multigate  modules  is  shown  to  be 
XP-hard.  Algorithms  that  improve  upon  those  developed  by  Chen  and  Sarrafzadeh 
for  general  circuits  are  also  developed. 
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CHAPTER  1 
INTRODUCTION 


1.1     Background 

The  design  and  fabrication  of  VLSI  chips  has  been  made  possible  by  the  automa- 
tion of  several  steps  in  the  design  process.  The  VLSI  design  process  transforms  a 
formal  specification  into  a  fully  packaged  chip.  It  consists  of  the  following  steps  [30]: 

1.  System  specification:  In  this  step  a  high  level  representation  of  the  system 
is  created.  Performance,  functionality,  physical  dimensions,  choice  of  design 
techniques  and  fabrication  technology  are  considered  in  this  step. 

2.  Functional  design:  The  output  of  this  step  is  a  timing  diagram  which  is  obtained 
by  considering  the  behavioral  aspects  of  the  system. 

3.  Logic  design:  The  logic  design,  in  general,  is  represented  by  Boolean  expres- 
sions. The  logic  design  that  represents  the  functional  design  is  obtained  in 
this  step.  The  Boolean  expressions  are  minimized  to  obtain  the  smallest  logic 
design.  Correctness  of  the  logic  design  is  also  asserted  in  this  step. 

4.  Circuit  design:  A  circuit  which  represents  the  logic  design  of  the  system  is  de- 
veloped in  this  step  by  taking  into  consideration  speed  and  power  requirements, 
and  the  electrical  behavior  of  the  components  used  in  the  development  of  the 
circuit. 


5.  Physical  design:  This  is  the  most  time  consuming  step  in  the  VLSI  design 
cycle.  In  this  step,  the  components  and  the  interconnections  are  represented  by 
geometric  patterns.  The  objective  of  this  step  is  to  obtain  an  arrangement  of 
these  geometric  patterns  which  minimizes  the  area  and  power  and  satisfies  the 
timing  requirements  of  the  chip.  Due  to  its  high  complexity  this  step  is  broken 
down  into  smaller  sub-steps.  We  will  look  into  this  step  in  detail  later  in  this 
chapter. 

6.  Design  verification:  In  this  step  design  rule  checking  and  circuit  extraction  are 
done  to  verify  that  the  circuit  layout  from  the  physical  design  step  satisfies  the 
system  specification  and  design  rules. 

7.  Fabrication:  The  verified  layout  is  used  in  the  fabrication  process  to  produce 
the  chip. 

8.  Packaging,  testing  and  debugging:  The  fabricated  chip  is  packaged  and  tested 
to  ensure  proper  functioning. 

1.2    Physical  Design  Automation 

Given  a  circuit  description,  the  physical  design  process  transforms  the  physical 
description  into  a  geometric  description  called  layout  for  fabrication.  As  the  complex- 
ity of  the  physical  design  process  is  extremely  large,  Computer-Aided  Design  (CAD) 
is  used  in  almost  all  phases  of  the  physical  design  process. 

The  physical  design  process  is  divided  into  5  stages  [27]: 


1.  Partitioning:  A  large  circuit  is  decomposed  into  a  collection  of  smaller  blocks 
of  sub-circuits  or  modules,  taking  sizes  and  interconnections  between  blocks  as 
factors.  Partitioning  can  be  hierarchical  if  the  given  circuit  is  very  large. 

2.  Floorplanning  and  placement:  Logical  components  of  each  block  are  assigned 
an  approximate  location  in  floorplanning.  In  placement,  blocks  are  exactly 
positioned  on  a  chip  so  as  to  minimize  the  area  of  the  chip  and  so  that  the 
interconnections  between  blocks  can  be  completed. 

3.  Routing:  The  interconnections  between  blocks  are  completed  as  specified.  In 
global  routing,  connections  are  completed  between  the  proper  blocks  of  the 
circuit  disregarding  the  exact  geometric  details  of  each  wire  and  pin.  In  detailed 
routing,  each  connection  is  assigned  a  precise  geometric  position. 

4.  Compaction:  The  layout  is  compressed  in  all  directions  to  reduce  the  area. 

5.  Extraction  and  verification:  The  final  layout  is  verified  in  terms  of  functionality 
by  circuit  extraction.  Other  specific  requirements,  such  as  performance  and 
reliability,  are  also  verified  in  the  verification  process. 

1.3     Dissertation  Outline 

In  this  dissertation,  we  consider  some  of  the  problems  that  arise  in  the  automation 
of  various  stages  of  the  VLSI  design  process.  In  Chapter  2,  we  consider  folding 
transistors  to  reduce  the  layout  area  of  a  row-based  design.  We  develop  an  optimal 


algorithm  to  fold  transistors  in  a  channel  to  minimize  the  layout  area.  Our  algorithm 
is  both  theoretically  and  practically  faster  than  the  algorithm  proposed  in  Kim  and 

Kang  [18]. 

In  Chapter  3,  we  consider  a  module  implementation  selection  algorithm  which 
minimizes  the  density  of  a  channel.  We  develop  an  optimal  algorithm  to  select  module 
implementations  along  a  channel  to  satisfy  the  net  span  constraints  of  each  net  and 
minimize  the  density  of  the  channel,  where  each  module  has  at  most  two  possible 
implementations.  The  algorithm  is  experimentally  compared  to  the  one  developed 
by  Her  et  al.  [11].  We  also  develop  a  polynomial- time  algorithm  for  the  multichannel 
version  of  the  problem. 

In  Chapter  4,  we  consider  resizing  gates  to  reduce  the  power  consumption.  We 
develop  fast  optimal  algorithms  to  resize  gates  in  series-parallel  circuits  and  trees  to 
minimize  the  power  consumption  subject  to  the  timing  constraint.  We  also  prove 
that  gate  resizing  with  multigate  modules  is  NP-haxd.  We  develop  fast  algorithms 
to  perform  gate  resizing  on  general  circuits.  Experimental  results  comparing  our 
algorithm  compared  with  that  in  Chen  and  Sarrafzadeh  [3]  are  also  presented. 

In  Chapter  5,  we  present  conclusions  and  some  future  directions  for  this  research. 


CHAPTER  2 
TRANSISTOR  FOLDING 


2.1     Introduction 


In  high-performance  circuit  design,  the  transistor  sizing  problem  was  investigated 
widely  in  the  past  (for  example.  [26,  7,  28,  4]).  The  objective  of  transistor  sizing  is 
to  reduce  the  circuit  delay  by  increasing  the  area  of  transistors.  One  by-product  of 
transistor  sizing  is  the  generation  of  layouts  of  transistors  of  widely  varying  size.  In 
row-based  layout  synthesis  ([17.  29,  32,  34]),  we  group  pMOS  and  nMOS  transistors 
together  and  place  them  in  rows.  The  layout  area  for  these  designs  is  wasted  due  to 
nonuniform  cell  heights.  The  layout  area  required  can  be  reduced  by  folding  large 
transistors  so  that  their  height  is  reduced.  Transistor  folding  to  optimize  layout 
area  has  been  considered  by  Kim  and  Kang  [18]  and  Her  and  Wong  [12].  Her  and 
Wong  [12]  have  developed  an  0(m6)  dynamic  programming  algorithm  for  the  general 
transistor  folding  problem.  (If  only  s  heights  are  possible  for  the  folded  transistors, 
the  complexity  of  Her  and  Wong's  algorithm  is  0{m3s3).  In  general,  s  is  0(m).)  Kim 
and  Kang  [18]  have  developed  a  more  practical  algorithm  for  the  case  of  row-based 
designs.  The  complexity  of  their  algorithm  is  0{m?  log m)  or  0(s(m  +  s)logm). 
They  also  show  that  the  area  of  row-based  designs  can  be  reduced  by  as  much  as 
30%  by  performing  transistor  folding.  In  this  paper,  we  consider  the  row-based- 
design  transistor-folding  problem  considered  in  reference  [18]  and  develop  an  0(m2) 


or  0{s(m  +  s))  algorithm  to  minimize  area.  We  also  report  on  experiments  conducted 
by  us  that  show  that  our  algorithm  actually  runs  much  faster  than  the  algorithm  of 
Kim  and  Kang  [18].  The  test  circuit  used  in  our  experiments  have  between  100  and 
100,000  transistor  pairs.  So,  our  tests  are  similar  to  those  conducted  by  Kim  and 
Kang  [18]  where  the  circuits  had  from  192  to  88,258  transistor  pairs. 

2.2    Problem  Formulation 

We  are  given  a  CMOS  circuit  with  a  row  of  m  transistor  pairs.  Each  transistor 
pair  consists  of  a  pMOS  transistor  and  its  dual  nMOS  transistor.  Let  p,  and  n,, 
respectively,  be  the  heights  of  the  pMOS  and  nMOS  transistors  in  the  zth  pair, 
1  <  i  <  m.  p,  and  n,  are  integers  that  give  transistor  height  in  multiples  of  the 
minimum  resolution  A.  Figure  2.1  shows  a  CMOS  circuit  with  4  pairs  of  transistors, 
P2  =  10  and  n2  =  12.  If  the  folding  height  of  pMOS  transistors  is  4  and  that  of 
nMOS  transistors  is  3,  then  the  circuit  layout  is  as  in  Figure  2.2.  The  second  pMOS 
transistor  is  divided  into  three  columns  of  height  4,  4,  and  2  respectively,  and  the 
second  nMOS  transistor  is  divided  into  four  columns  of  height  3  each.  The  area 
occupied  by  the  folded  transistor  pair  is  shown  by  a  shaded  box  in  Figure  2.2.  In 
practice,  the  height  of  the  layout  area  is  slightly  larger  than  the  sum  of  the  pMOS 
and  nMOS  folding  heights,  and  the  layout  width  is  slightly  larger  than  the  number 
of  transistor  columns  because  of  overheads. 
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Figure  2.1.  An  example  circuit  with  4  pairs  of  transistors 
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Figure  2.2.  The  Circuit  of  Figure  2.1  after  folding  with  hp  =  4  and  hn  =  3 


Let  hp  and  hn  be  the  folded  heights  of  the  pMOS  and  nMOS  transistors,  respec- 
tively. The  width  of  the  folded  layout  is  ££i  max(f£l,  F£l)  +  c*  and  the  heiSht  * 
flp  +  hn  +  cv  where  Cy  and  ch  are,  respectively,  vertical  and  horizontal  overheads.  The 
area  of  the  folded  layout  is  [18] 

A  =  (hp  +  hn  +  cv)(£max(\£l\^)  +  ch)  (2.1) 

In  practice,  there  is  a  technological  constraint  on  how  small  hp  and  hn  can  be.  It 
is  required  [18]  that  /ip  >  PMIN  and  hn  >  NMIN. 

Kim  and  Kang  [18]  give  two  algorithms  to  determine  hp  and  hn  so  that  the  layout 
area  is  minimized.  The  first  algorithm  is  an  exhaustive  search  algorithm  that  simply 
tries  out  all  integer  choices  for  hp  and  h^  such  that  PMIN  <  hp  <  maxi<j<m{pi}  = 
max(P)  and  NMIN  <  hn  <  max,<,<m{ni}  =  max(iV).  (Here  P  =  {pi,P2,-.-,Pm} 
and  A^  =  {ni,n2,  ■ . .  ,nm}.)  The  complexity  of  the  exhaustive  search  algorithm  is 
0(max(P)-max(iV)-m)  =  0(m3)  because  max(P)  and  max(iV)  are  0{m)  for  practical 
circuits  [18]. 

The  second  algorithm  [18]  works  in  two  phases.  In  the  first  phase,  the  al- 
gorithm constructs  a  subset  Sp  of  [PMIN,  max(P)]  and  another  subset  Ss  of 
[NMIN,  max(N)]  with  the  property  that  the  optimal  hp  is  in  Sp  and  the  optimal  hn 
is  in  SN .  The  basic  observation  used  to  arrive  at  Sp  and  SN  is  that  if  the  heights  hx 
and  h\  -f-  k  divide  a  transistor  into  the  same  number  of  columns  then  hi  is  preferred 


over  hi+k  (for  example  if  ft  =  14,  then  folding  heights  7.  8.  9.  10,  11,  12  and  13 
all  fold  the  transistor  into  two  columns;  7  is  preferred  over  the  remaining  choices). 
In  the  second  phase  the  optimal  combination  {hp.hn)  is  determined  from  Sp  and 
SN.  The  complexity  of  the  second  phase  is  0{s{m  +  s)  logm)  =  0(m2  logm),  where 
a  =  \SP\  +  |SA'|,  and  that  of  the  first  phase  is  9(E£i(Pi  +  «,-)}  =  0{m?)  (assuming 
max(P)  and  max(A')  are  0{m)). 

2.3     Our  Algorithm 

2.3.1     Phase  I 

Our  algorithm  is  also  a  two  phase  algorithm.  The  first  phase  of  our  algorithm  is 
identical  to  the  first  phase  of  Kim  and  Kang's  algorithm  [18].  We  compute  the  subsets 
Sp  and  SN  using  the  code  of  Figure  2.3.  The  arrays  SPL  and  SNL  are  initialized  to 
zero  in  the  first  two  for  loops.  Then  we  determine  the  members  of  Sp  and  SN;  we 
set  SPL[i]  =  1  if  and  only  if  i  6  Sp  and  SPN[i\  =  1  if  and  only  if  i  G  SN .  Finally, 
Sp  and  SN  are  computed  in  compact  form  from  SPL  and  SPN  respectively.  Note 
that  we  can  compute  Sp  and  SN  in  either  ascending  or  descending  order  easily  by 
controlling  the  direction  of  traversal  of  the  SPN  and  SPL  arrays  respectively,  in  the 
last  two  for  loops.  The  algorithm  presented  in  Figure  2.3  computes  Sp  in  ascending 
order  and  SN  in  decending  order. 
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Algorithm  Phase  I  (P,  N,  PMIN,  NMIN) 
/*  Compute  Sp  and  SN  */ 

/*  Initialize  SPL\\  and  SNL\\  */ 
for  i  =  PMIN  to  max(P)  do 

SPL[i\  <-  0; 
for  i  -  NMIN  to  max(iV)  do 

SNL[i\  <-  0; 

/*  set  5PLQ  and  SATLQ  */ 
for  i i  =  1  to  m  do 
for  j  =  1  to  p,  do 

aPUffll  «- 1; 

for  j  =  1  to  77.^  do 

mcryii  ♦- 1: 

end  for 

/*  collect  items  from  SPL\\  and  SNL\\  and  store  them  into  Sp\\  and  SN\\  */ 
SPree  «-  0;  SNsize  <-  0; 
for  i  =  PM/^  to  max(P)  do 
if  SPL\i]  =  1  then 
Sp[SPsize++]  4-  t; 
for  i  =  max(iV)  downto  NM/iV  do 
if  SNL[i]  =  1  then 
SN[SNsize++]  «-  t; 


Figure  2.3.  Computing  5P  and  5N 
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2.3.2     Phase  II 

Assume  that  the  transistor  pairs  have  been  reordered  so  that  f*  <  ^L±x:  also 
assume  that  a  =  0  and  EvLi±  =  oc.  The  formula,  Equation  2.1,  for  the  layout  area 

Ho  rtm  +  l 

can  be  rewritten  as 


A  =  (hp  +K  +  *)(E)frl  +  tljr^  + c")  <2-2) 

M=l     nn  i=k    nP 


where  k  £  [1,  m  +  1]  is  such  that 


*t±  <  i  <  ^  (2.3) 


Let  Ls(hn,k)  =  Efcittjl  and  MVfc)  =  £*Ufgl,  we  can  rewrite  Equation 
2.2  as 

A  =  (/ip  +  K  +  c^Mk.,  fc)  +  Mfy»  *)  +  c/,)  (2.4) 

From  Equation  2.3  and  the  ordering  of  the  transistors  by  jj,  it  follows  that  if  kn  is 
held  fixed  and  hp  increased,  the  value  of  k  cannot  decrease.  This  observation  results 
in  the  algorithm  of  Figure  2.4. 

Since  Sp  and  SN  can  be  computed  in  ascending  and  descending  order  respectively 
by  Algorithm  Phase  I  of  Figure  2.3,  no  sorting  is  needed  to  evaluate  the  members 
of  Sp  and  SN  in  the  specified  order.  We  can  sort  the  transistors  into  increasing 
(actually  nondecreasing)  order  of  j*  in  0(m  log  m)  time;  and  the  arrays  Ls  and 


Algorithm  Phase  II  {P,N,Sp,SN,ch,cv) 
/*  Sp  is  in  ascending  order  and  SN  is  in  descending  order  */ 
Sort  P  and  N  in  increasing  P[i]/N[i]  ratio; 

Compute  l^DO  and  ^00; 

for  each  hn  6  SN  do 
Jfc4-1; 

for  each  hp  G  Sp  do 
while  P[ik]/iV[lk]  <  hp/hn  do 

fc<-  A:  +  l; 
.4  =  min(.4,  (hp  +  hn  +  c)  *  (Lyv[/i„PJ  +  Lp[/ip][/c]  +  cA)); 
end  for 
end  for 


Figure  2.4.  Compute  optimal  hp  and  /?.„ 

LP  can  be  computed  in  e{m\SN\)  and  9(m|5p|)  time  respectively.  Each  itera- 
tion of  the  outer  for  loop  takes  0{\SP\  +  m)  time.  Therefore,  the  time  needed 
for  all  \SN\  iterations  is  0(|5N|(|5P|  +  m)).  We  can  change  this  complexity  to 
0(min{|S*|,  |Sp|}(max{|S*|,  |5P|}  +  m))  by  interchanging  the  inner  and  outer  for 

loop  headers. 

Further  improvement  in  run  time  is  possible.  Consider  the  algorithm  of  Figure  2.4. 
Let  it,  be  the  k  value  that  satisfies  Equation  2.3  when  we  use  the  first  (i.e.,  largest) 
hn  value  Kx  and  the  ith  (i.e.  zth  smallest)  hp  value  hPx.  On  the  next  iteration  of  the 
outer  for  loop,  hn2  <  *»„  80  Jj-  <  jj,  and  the  k  value  that  satisfies  Equation  2.3 
is  at  least  kx.  Hence  if  we  save  k\  from  the  first  iteration,  we  can  start  the  search  for 
the  new  k  value  at  k\.  This  observation  leads  to  the  refinement  shown  in  Figure  2.5. 
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Although  its  worse-case  complexity  is  the  same  as  that  of  Figure  2.4,  it  is  expected 
to  run  faster  in  practice. 

2.4     Experimental  Results 

The  phase  1  algorithm  of  Figure  2.3,  the  phase  2  algorithm  of  Figure  2.5,  and 
the  two  algorithms  of  Kim  and  Kang  [18]  were  implemented,  by  us,  in  C  and  run  on 
a  SUN  SPARCstation  4.  Similar  programming  methodologies  were  used  to  develop 
the  codes  for  our  algorithm  and  that  of  Kim  and  Kang  [18].  As  a  result,  we  expect 
that  almost  all  of  the  performance  difference  exhibited  in  our  experiments  is  due  to 
algorithmic  rather  than  programming  differences.  Since  we  were  unable  to  obtain 
the  test  data  used  by  Kim  and  Kang  [18],  we  generated  random  data.  We  ignore 
any  possible  correlation  between  pMOS  and  nMOS  transistors.  For  our  test  data, 
the  number  of  transistor  pairs  ranged  from  100  to  100,000.  This  covers  the  range 
in  transistor  numbers  (192  to  88,258)  in  the  circuits  of  Kim  and  Kang  [18].  For 
our  first  test  set,  the  sizes  of  the  pMOS  and  nMOS  transistors  were  generated  using 
a  uniform  random  number  generator  with  range  [30, 90]  for  pMOS  and  [20, 60]  for 
nMOS.  These  size  ranges  correspond  to  those  for  the  circuit  fract  that  was  used  by 
Kim  and  Kang  [18],  the  circuit  fract  has  598  transistors.  Since  all  three  algorithms 
generate  optimal  solutions,  run  time  is  the  only  comparative  factor.  This  time  is 
provided  in  Table  2.1.  The  exhaustive  search  algorithm  was  not  run  for  m  >  10,000 
as  its  run  time  becomes  prohibitive.  In  the  case  of  the  algorithm  proposed  by  Kim 
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Algorithm  Refined  Phase  II  {P,  N,  Sp,  SN,ch,  c) 

/*  Sp  is  in  ascending  order  and  SN  is  in  descending  order  */ 
Sort  P  and  N  in  increasing  P[i]/N[i]  ratio; 
Compute  La-QQ  and  LP{\\\; 
Initialize  Khp  to  0  for  all  hp  €  Sp 
if  |5'v:  <  \SP\  then 
for  each  hn  €  5A  do 
fc«-l; 

for  each  /ip  G  Sp  do 
A\p  <-  max(A\A%,); 
while  P[k]/X[k]  <  hp/hn  do 

k  «-  k  + 1; 
A"hj>  4-  fc; 

.4  =  min(.4,  (np  +  /in  +  c)  *  (L*[/i„p]  +  LP[hp][k]  +  cb)); 
end  for 
end  for 
else 

/*  same  as  "if ,  but  interchange  the  inner  and  outer  for  loop  headers,  and  replace 
Khf  by  Khn  */ 
end  if 


Figure  2.5.  Refined  Phase  2  algorithm 
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and  Kang  [18],  the  phase  2  time  is  significantly  larger  than  the  phase  1  time.  Our 
algorithm  for  phase  2  has  brought  this  time  down  to  approximate  the  phase  1  time. 
For  small  circuits  (m  <  10,000).  our  phase  2  algorithm  is  6  to  10  times  as  fast  as 
the  phase  2  algorithm  of  Kim  and  Kang  [18]  and  provides  an  overall  speedup  of  3.5 
to  5.8  for  the  entire  area  minimization  process  (phase  1  plus  phase  2).  On  larger 
circuits,  the  speedup  is  more  dramatic.  For  instance,  when  m  —  100. 000  our  phase 
2  algorithm  is  almost  50  times  as  fast  as  that  of  Kim  and  Kang  [18]  and  provides  an 
overall  speedup  of  almost  28. 

We  experimented  with  two  other  data  sets.  Table  2.2  reports  the  run  times  for 
circuits  in  which  the  range  of  the  uniform  random  number  generator  was  set  to 
[30. 180]  for  pMOS  transistor  sizes  and  [20, 120]  for  nMOS  sizes,  and  Table  2.3  gives 
the  run  times  when  the  transistor  sizes  are  from  a  normal  distribution  with  mean  40 
and  standard  deviation  10  for  pMOS  transistors  and  mean  30  and  standard  deviation 
10  for  nMOS  transistors.  The  overall  speedups  range  from  a  low  of  3.95  to  a  high  of 
48.02. 

2.5    Conclusion 

We  have  developed  a  transistor  folding  algorithm  that  is  both  theoretically  and 
practically  faster  than  the  algorithm  proposed  by  Kim  and  Kang  [18].  Our  algorithm 
is  also  simpler  to  code.  Experiments  suggest  that  our  algorithm  runs  3  to  50  times  as 
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Table  2.1.  Run  time  and  speedup  using  a  uniform  distribution 


m 

Exhaustive 

Phase  1 

Phase  2 

Speedup 

Kim  &  Kang 

Our 

Phase  2 

Overall 

100 

1.46 

0.03 

0.30 

0.03 

10.00 

5.55 

300 

4.41 

0.08 

0.60 

0.09 

6.67 

4.00 

500 

7.34 

0.14 

0.89 

0.14 

6.36 

3.62 

600 

8.79 

0.16 

1.05 

0.17 

6.18 

3.63 

1000 

14.67 

0.28 

1.69 

0.27 

6.26 

3.56 

5000 

74.59 

1.38 

11.43 

1.45 

7.88 

4.53 

10000 

149.12 

2.75 

30.71 

3.01 

10.20 

5.81 

50000 

- 

13.64 

458.24 

17.51 

26.17 

15.15 

100000 

- 

27.24 

1716.02 

35.29 

48.63 

27.88 

Time  in  seconds 


Table  2.2.  Run  time  and  speedup  using  a  uniform  distribution  with  larger  limits 


m 

Exhaustive 

Phase  1 

Phase  2 

Speedup 

Kim  &:  Kang 

Our 

Phase  2 

Overall 

100 

6.35 

0.05 

0.97 

0.06 

16.17 

9.67 

300 

19.87 

0.16 

2.18 

0.20 

10.90 

6.58 

500 

33.15 

0.27 

2.94 

0.33 

8.91 

5.39 

600 

39.77 

0.32 

3.38 

0.40 

8.45 

5.16 

1000 

66.31 

0.53 

4.73 

0.65 

7.28 

4.47 

5000 

336.92 

2.60 

21.82 

3.31 

6.59 

4.13 

10000 

673.43 

5.23 

49.25 

6.89 

7.15 

4.50 

50000 

- 

26.09 

485.10 

38.87 

12.48 

7.87 

100000 

- 

52.12 

3710.35 

85.40 

43.45 

27.36 

Time  in  seconds 
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Table  2.3.  Run  time  and  speedup  using  a  normal  distribution 


m 

Exhaustive 

Phase  1 

Phase  2 

Speedup 

Kim  &  Kang 

Our 

Phase  2 

Overall 

100 

0.90 

0.02 

0.20 

0.02 

10.00 

5.30 

300 

3.22 

0.07 

0.47 

0.06 

7.83 

4.24 

500 

5.79 

0.10 

0.75 

0.11 

6.82 

3.98 

600 

6.70 

0.12 

0.88 

0.13 

6.77 

3.95 

1000 

12.69 

0.20 

1.48 

0.23 

6.43 

3.96 

5000 

68.26 

1.03 

12.92 

1.38 

9.36 

5.79 

10000 

129.67 

1.99 

36.91 

2.80 

13.18 

8.12 

50000 

- 

10.05 

679.50 

18.05 

37.65 

24.54 

100000 

- 

20.04 

2676.08 

36.10 

74.13 

48.02 

Time  in  seconds 


fast  as  the  Kim  and  Kang's  algorithm  [18]  on  circuits  with  100  to  100,000  transistor 
pairs.  These  circuit  sizes  are  comparable  to  theirs. 


CHAPTER  3 
PERFORMANCE  DRIVEN  MODULE  IMPLEMENTATION  SELECTION 


3.1     Introduction 

In  the  channel  routing  problem,  we  have  a  routing  channel  with  modules  on  the 
top  and  bottom  of  the  channel,  the  modules  have  pins,  and  subsets  of  pins  define 
nets.  The  objective  is  to  route  the  nets  while  minimizing  channel  height.  Several 
algorithms  have  been  proposed  for  channel  routing  [35]. 

When  the  modules  on  either  side  of  the  channel  are  programmable  logic  arrays, 
we  have  the  flexibility  of  reordering  the  pins  in  each  module;  any  pin  permutation 
may  be  used.  The  ability  to  reorder  module  pins  adds  a  new  dimension  to  the 
routing  problem.  Channel  routing  with  rearrangeable  pins  was  studied  by  Kobayashi 
and  Drozd  [19].  They  proposed  a  three  step  algorithm:  (1)  permute  pins  so  as  to 
maximize  the  number  of  aligned  pin  pairs  (a  pair  of  pins  on  different  sides  of  the 
channel  is  aligned  iff  they  occupy  the  same  horizontal  location  and  they  are  pins  of 
the  same  net),  (2)  permute  the  nonaligned  pins  so  as  to  remove  cyclic  constraints, 
and  (3)  while  maintaining  an  acyclic  vertical  constraint  graph,  permute  unaligned 
pins  so  as  to  minimize  channel  density.  Lin  and  Sahni  [21]  developed  a  linear  time 
algorithm  for  step  (1),  and  Sahni  and  Wu  [25]  showed  that  steps  (2)  and  (3)  are 
NP-hard.  Tragoudas  and  Tollis  [31]  present  a  linear  time  algorithm  to  determine 
whether  there  is  a  pin  permutation  for  which  a  channel  is  river  routable.  They  also 
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showed  that  the  problem  of  determining  a  pin  permutation  that  results  in  minimum 
density  is  NP-hard  in  general,  and  they  developed  polynomial  time  algorithms  for 
the  special  case  of  channels  with  two  terminal  nets  and  channels  with  at  most  one 
terminal  of  each  net  being  in  each  module. 

Variants  of  the  channel  routing  with  permutable  pins  problem  have  also  been 
studied  [14,  2,  16,  13].  In  these  variants  restrictions  are  placed  on  the  allowable 
pin  permutations  for  each  module.  Restrictions  may  arise,  for  example,  because  the 
module  library  contains  only  a  limited  set  of  implementations  of  each  module  [14]. 
Another  variant,  considered  by  Cai  and  Wong  [2]  permits  the  shifting  of  modules  and 
pins  to  minimize  channel  density.  Extensions  to  the  case  when  over  the  cell  routing 
is  permitted  are  considered  [16.  13]. 

The  variant  of  the  channel  routing  with  permutable  pins  problem  that  we  consider 
in  this  paper  is  the  performance-driven  module  implementation  selection  (PDMIS) 
problem  formulated  by  Her  et  al.  [11].  In  the  /:-PDMIS  problem,  we  are  given  two 
rows  of  modules  with  a  routing  channel  in  between,  up  to  k  possible  implementations 
for  each  module  (different  implementations  of  a  module  differ  only  in  the  location  of 
pins,  the  module  size  and  pin  count  are  the  same)  and  a  set  of  net  span  constraints 
(the  span  of  a  net  is  the  distance  between  its  leftmost  and  rightmost  pins).  A  feasible 
solution  to  a  fc-PDMIS  instance  is  a  selection  of  module  implementations  so  that  all 
net  span  constraints  are  satisfied.  An  optimal  solution  is  a  feasible  solution  with 
minimum  channel  density. 
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Figure  3.1(a)  shows  a  routing  channel  with  two  modules  on  either  side  of  the 
routing  channel.  Assume  that  each  module  has  two  implementations  and  that  the 
pin  locations  for  the  second  implementation  of  each  module  are  as  in  Figure  3.1(b). 
The  net  span  constraints  of  the  five  nets  are  4,  4,  1,  1  and  6,  respectively.  This  defines 
an  instance  of  the  2-PDMIS  problem.  Using  the  implementations  of  Figure  3.1(a),  the 
net  spans  are  5,  3,  1,  1  and  6,  respectively.  The  span  constraint  of  net  1  is  violated. 
If  each  module  is  implemented  as  in  Figure  3.1(b),  the  net  spans  are  1,  5,  1,  1  and 
4,  respectively.  This  time,  the  span  constraint  of  net  2  is  violated.  If  we  implement 
the  modules  as  in  Figure  3.1(c)  (i.e.,  for  modules  1  and  2  use  the  implementations 
of  Figure  3.1(a)  and  for  modules  3  and  4,  use  the  implementations  of  Figure  3.1(b)), 
the  net  spans  are  4,  4,  1,  1  and  6,  respectively.  Now,  the  net  span  constraints  are 
satisfied  for  all  nets.  The  channel  density,  when  module  implementations  are  selected 
as  in  Figure  3.1(c),  is  5.  Selecting  module  implementations  as  in  Figure  3.1(d),  we 
obtain  a  feasible  solution  whose  density  is  3. 

Her  et  al.  [11]  show  that  the  fc-PDMIS  problem  is  NP-hard  for  every  k  >  3. 
For  the  2-PDMIS  problem,  they  develop  an  0(p\ogn)  algorithm  to  find  an  optimal 
solution.  In  this  paper,  we  develop  an  alternative  O(plogn)  algorithm  to  find  an 
optimal  solution  to  the  2-PDMIS  problem.  Experiments  indicate  that  our  algorithm 
is  twice  as  fast  on  small  circuits  and  up  to  eleven  times  as  fast  on  larger  circuits. 

We  begin,  in  Section  3.2,  by  providing  an  overview  of  the  0(p\ogn)  algorithm  [11]. 
Then,  in  Section  3.3,  we  describe  our  O(plogn)  algorithm.  In  Section  3.4,  we  develop 
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Figure  3.1.  An  example  PDMIS  problem,  (a)  first  implementation;  (b)  second  im- 
plementation; (c)  selections  that  satisfy  the  net  span  constraints;  (d)  selection  with 
better  density 
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an  0{pnc-x)  algorithm  for  the  c,  c  >  1,  channel  2-PDMIS  problem.   Experimental 
results  using  the  single  channel  2-PDMIS  algorithm  are  presented  in  Section  3.5. 

3.2    Ofplogn)  Algorithm  of  Her  et  al. 

Her  et  al.  [11]  show  how  to  transform  an  instance  P  of  2-PDMIS  with  net  span 
constraints  and  a  constraint,  d,  on  channel  density  into  an  instance  S  of  the  2-SAT 
problem  (each  instance  of  the  2-SAT  problem  is  a  conjunctive  normal  form  formula 
in  which  each  clause  has  at  most  two  literals.)  The  2-SAT  instance  S  is  satisfiable  iff 
the  corresponding  2-PDMIS  instance  has  a  feasible  solution  with  channel  density  <  d. 
The  size  of  the  constructed  2-SAT  formula  S  is  0(p),  where  p  is  the  total  number  of 
pins  in  the  modules  of  P.  Since  the  channel  density  of  the  optimal  solution  is  in  the 
range  [1,  n],  where  n  is  the  total  number  of  nets,  a  binary  search  over  d  can  be  used 
to  obtain  an  optimal  solution  in  O(plogn)  time. 

Her  et  al.  [11]  use  one  boolean  variable  to  represent  each  module.  The  interpre- 
tation is,  variable  x,  is  true  iff  implementation  1  of  module  i  is  selected.  The  steps 
in  the  2-PDMIS  algorithm  [11]  are 

1.  Construct  the  2-SAT  formula  C3pan  such  that  Cspan  is  satisfiable  iff  the  given 
2-PDMIS  formula  has  a  feasible  solution.  This  is  done  by  constructing  a  2-SAT 
formula  for  each  net  and  then  taking  the  conjunction  of  these  instances.  For 
each  net  j,  the  leftmost  and  rightmost  modules  on  the  top  row  and  bottom 
row  are  identified.   These  (at  most  four)  modules  are  the  critical  modules  for 


23 

net  j  as  the  span  of  net  j  is  determined  solely  by  these  modules.  A  2-SAT 
formula  involving  the  boolean  variables  that  represent  these  critical  modules  is 
constructed.  This  2-SAT  formula  has  the  property  that  truth  value  assignments 
satisfy  the  2-SAT  formula  iff  the  corresponding  module  implementations  cause 
the  net  span  constraint  for  net  j  to  be  satisfied. 

2.  Construct  a  2-SAT  formula  Cden  using  a  density  constraint  d.  Cden  is  satisfiable 
only  by  module  implementation  selections  which  result  in  a  channel  density 
that  is  <  d.  To  construct  Cden,  partition  the  channel  into  a  minimum  number 
of  regions  such  that  no  region  contains  a  module  boundary  in  its  interior:  for 
each  region,  construct  a  2-SAT  formula  so  that  the  density  in  the  region  is 
<  d  whenever  the  2-SAT  formula  is  true  (this  2-SAT  formula  involves  only  the 
module  in  the  top  row  of  the  region  and  the  one  in  the  bottom  row);  take  the 
conjunction  of  the  region  2-SAT  formulae. 

3.  Determine  if  the  2-SAT  formula  Cspan  A  (?«&.„  is  satisfiable  by  using  the  strongly 
connected  components  method  described  in  Papadimitriou  and  Steiglitz  [22]. 
This  requires  that  we  first  construct  a  directed  graph  from  Cspan  A  Cden- 

4.  Repeat  steps  2  and  3  performing  a  binary  search  for  the  minimum  value  of  d 
for  which  Cspan  A  Cden  is  satisfiable. 

As  shown  in  Her  et  al.   [11],  the  size  of  Cspan  A  Cden  is  0(p);  step  3  takes  0(p) 
time;  and  the  overall  complexity  is  O(plogn). 
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3.3    Our  O(plogn)  Algorithm 

Our  algorithm  is  a  two  stage  algorithm  that  does  not  construct  a  2-SAT  formula. 
In  the  first  stage,  we  construct  a  set  of  2m  "forcing  lists" ,  where  m  is  the  number  of 
modules.  L[i]  is  a  list  of  module  implementation  selections  that  get  forced  if  the  first 
implementation  of  module  i,  1  <  i  <  m  is  selected;  L[m  +  i]  is  the  corresponding 
list  for  module  i  when  the  second  implementation  of  module  i  is  selected.  By  forced, 
we  mean  that  unless  the  module  implementations  on  L[i]  (L[m  +  i))  are  selected 
whenever  the  first  (second)  implementation  of  module  i  is  selected,  we  cannot  have  a 
feasible  solution  that  also  satisfies  the  given  density  constraint.  In  the  second  stage, 
we  use  the  limited  branching  method  [6]  and  the  forcing  lists  constructed  in  stage  1 
to  obtain  a  module  implementation  selection  that  satisfies  the  net  span  and  density 
constraints  (provided  such  a  selection  is  possible).  To  find  an  optimal  solution,  we 
use  binary  search  to  determine  the  smallest  density  constraint  for  which  a  feasible 
solution  exists. 

3.3.1     Stage  1 

In  stage  1,  we  construct  the  forcing  lists  L[1..2m].  If  the  selection  of  implementa- 
tion 1  of  module  i  requires  that  we  select  implementation  1  of  module  j,  we  place  j  on 
the  list  L[i\;  if  the  selection  of  implementation  1  of  module  i  requires  that  we  select 
implementation  2  of  module  j,  we  place  m  +  j  on  L[i].  Similarly  when  the  selection 
of  implementation  2  of  module  i  requires  a  particular  implementation  be  selected  for 
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module  j,  we  place  either  j  or  m  +  j  on  L[m  +  i}.  To  assist  in  the  construction  of 
the  forcing  lists,  we  use  another  array  C[l..m]  with  C[i]  =  0  if  no  implementation  of 
module  i  has  been  selected  so  far;  C[i]  =  1  if  the  first  implementation  of  module  i 
has  been  selected;  and  C[i]  =  2  if  the  second  implementation  has  been  selected. 

First,  we  construct  the  forcing  lists  necessary  to  ensure  the  net  span  constraints. 
For  each  net  i  for  which  a  net  span  constraint  is  specified,  identify  the  leftmost  and 
rightmost  modules,  in  each  module  row,  that  contain  net  t  (see  Figure  3.2).  There 
are  at  most  four  such  modules:  leftmost  module  with  net  i  in  the  top  module  row 
(module  u  of  Figure  3.2),  leftmost  in  the  bottom  module  row  (w),  rightmost  in  top 
row  (u)  and  rightmost  in  bottom  row  (x).  The  span  of  net  i  is  determined  by  a  pair 
of  these  critical  modules.  One  module  in  this  pair  is  a  leftmost  critical  module  and 
the  other  is  a  rightmost  critical  module.  So,  there  are  at  most  four  module  pairs  to 
consider  (for  the  example  of  Figure  3.2,  these  four  pairs  are  (u,  v),  (w,v),  {u.x)  and 
(w,x)). 

When  a  critical  module  pair  is  considered,  let  A  denote  the  implementation  of  the 
left  module  (of  the  pair)  in  which  the  leftmost  pin  of  net  i  is  to  the  right  of  the  leftmost 
pin  of  net  i  in  the  other  implementation  (ties  are  broken  arbitrarily).  Let  .4'  denote 
the  other  implementation  of  the  left  module.  Let  B  denote  the  implementation  of  the 
right  module  for  which  the  rightmost  pin  of  net  i  is  to  the  left  of  the  rightmost  pin 
of  net  i  in  the  other  implementation  (ties  are  broken  arbitrarily).  Let  B'  denote  the 
other  implementation  of  the  right  module.  In  the  example  of  Figure  3.2,  consider  the 
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critical  module  pair  (u,  x),  u  is  the  left  module  and  x  is  the  right  module.  The  second 
implementation  of  u  is  .4  and  its  first  implementation  is  A'\  the  first  implementation 
of  x  is  B  and  its  second  implementation  is  B'.  There  are  four  ways  in  which  we 
can  select  the  implementations  of  the  modules  u  and  x:  {A,B),  (A,B'),  (A',B)  and 
(A1,  B').  For  each  of  these  four  selections,  we  can  determine  the  span  of  net  i  and 
classify  the  selection  as  feasible  (i.e.,  does  not  violate  the  net  span  constraint)  or 
infeasible.  Notice  that  if  the  selection  (A,  B)  violates  the  net  span  constraint  for  net 
i,  then  each  of  the  remaining  three  selection  pairs  also  violates  the  net  span  constraint 
for  this  net. 
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Figure  3.2.  Critical  modules  of  net  i 

We  have  the  following  possibilities: 

Case  1:  [No  selection  is  infeasible.]  All  four  selections  are  feasible.  In  this  case  no 
addition  is  made  to  the  forcing  lists. 


Case  2:  [Exactly  one  selection  is  infeasible.]  The  infeasible  selection  must  be  (.4',  B') 
and  the  other  three  selections  are  feasible.  Now,  the  selection  of  A'  forces  us  to 
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select  B  and  the  selection  of  B'  forces  us  to  select  A.  Therefore,  we  add  B  to 
the  forcing  list  for  .4'  and  A  to  that  for  B' .  To  add  B  to  the  forcing  list  of  A' 
(and  similarly  to  add  A  to  the  list  of  B').  we  first  check  CO  to  determine  if  an 
implementation  for  the  module  corresponding  to  A'  has  already  been  selected. 
If  no  implementation  has  been  selected,  we  simply  append  B  to  the  list  for 
A'.  If  the  implementation  A  has  been  selected,  then  we  do  nothing.  If  the 
implementation  A'  has  been  selected,  then  the  implementation  B  is  forced  and 
we  run  the  function  Assign  (L,  C,  B)  of  Figure  3.3  which  selects  implementation 
B  as  well  as  other  implementations  that  may  now  be  forced.  This  function 
returns  the  value  False  iff  it  has  determined  that  no  feasible  solution  exists. 


Algorithm  Boolean  Assign  (L\\,C\\, M) 
I*  Select  implementation  M  and  related  modules  */ 

if  M  is  selected  then 

return  True; 
if  M'  is  selected  then 

return  False; 
/*  M  is  undecided  */ 
Mark  M  selected  in  Cfli 
for  each  X  £  L[A]  do 

if  not  Assign  (L,C,X)  then 
return  False; 
end  for 

Remove  L[M]  and  L[M'}; 
return  True; 


Figure  3.3.  Function  Assign 
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Case  3:  [Exactly  two  selections  are  infeasible.]    This  can  arise  in  one  of  two  ways 

(a)  {A,  B)  and  {A,  B')  are  feasible  and  {A1,  B')  and  (.4',  B)  are  infeasible  and 

(b)  (A,  B)  and  {A1,  B)  are  feasible  and  (A',  B')  and  (.4,  B')  are  infeasible.  In 
case  (a),  we  must  select  implementation  A.  This  is  done  by  executing  Assign 
(L,  C,  A).  In  case  (b),  we  must  select  implementation  B:  so,  we  perform  Assign 
(L,C,B). 

Case  4:  [Exactly  three  selections  are  infeasible.]    Now  (.4.  B)  is  the  only  feasible 
selection  and  we  perform  Assign  (L,C,A)  and  Assign  (L,C,B). 

Case  5:  [All  four  selections  are  infeasible.]   In  this  case,  the  2-PDMIS  instance  has 
no  feasible  solution. 

Once  we  have  constructed  the  forcing  lists  for  the  net  span  constraints,  we  proceed 
to  augment  these  lists  to  account  for  the  channel  density  constraint.  Of  course,  this 
augmentation  is  to  be  done  only  when  we  haven't  already  determined  that  the  given 
2-PDMIS  is  infeasible.  Our  strategy  to  augment  the  forcing  lists  to  account  for  the 
density  constraint  begins  by  partitioning  the  routing  channel  into  regions  such  that 
no  module  boundary  falls  inside  of  a  region  (see  Figure  3.4). 

To  ensure  that  the  channel  density  is  <  d,  we  require  that  the  density  in  each 
region  of  the  channel  be  <  d.  This  can  be  done  by  examining  each  channel  region. 
Let  T  be  the  module  on  the  top  row  of  the  channel  region  and  B  the  module  on 
the  bottom  row.  The  density  in  this  channel  region  is  completely  determined  by  the 
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Figure  3.4.  Partition  a  routing  channel  into  regions 

nets  that  enter  this  region  from  its  left  or  right  and  by  the  implementations  of  T 
and  B.  Let  Ti,T2  (Bi,B2)  denote  the  two  possible  implementations  of  T  (B).  We 
have  four  possible  implementation  pairs  (Tx,Bx),  (TUB2),  {T2,BX)  and  (T2,B2).  We 
can  determine  which  of  these  four  implementation  pairs  are  infeasible  (i.e.  result  in 
a  channel  region  density  >  d)  and  use  a  case  analysis  similar  to  that  used  above  for 
net  span  constraints.  The  cases  are 


Case  1:  [None  are  infeasible.]  Do  nothing. 

Case  2:  [Exactly  one  is  infeasible.]  Suppose,  for  example,  only  (Ti,B2)  is  infeasible. 
We  need  to  add  Bi  to  the  forcing  list  for  Tx  and  T2  to  the  list  for  B2.  This  is 
similar  to  case  2  for  net  span  constraints. 

Case  3:  [Exactly  two  are  infeasible.]  This  can  happen  in  one  of  six  ways.  If  the 
feasible  pairs  are  (TX,B2)  and  {T2,BX),  then  Tx  forces  B2,  B2  forces  Tu  T2 
forces  Bx  and  Bx  forces  T2.  The  remaining  five  cases  are  similar. 
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Ca.se  4'  [Exactly  three  are  infeasible.]  There  are  four  ways  this  can  happen.  For 
example,  if  (Ti.Bi)  is  the  only  feasible  pair,  then  implementations  T\  and  Bx 
must  be  selected.  The  remaining  three  cases  are  similar. 

Case  5:  [All  four  are  infeasible.]  The  2-PDMIS  instance  with  density  constraint  d 
has  no  feasible  solution. 

3.3.2     Stage  2 

If  following  stage  1  we  have  not  determined  that  the  2-PDMIS  instance  is  infea- 
sible. stage  2  is  entered.  If  no  nonempty  forcing  list  remains,  all  implementations  of 
the  modules  for  which  no  implementation  has  been  selected  result  in  feasible  solu- 
tions. When  nonempty  forcing  lists  remain,  we  use  the  limited  branching  method  [6] 
to  make  the  remaining  module  implementation  selections.  In  this  method,  we  start 
with  a  module  i  whose  implementation  is  yet  to  be  selected.  For  this  module,  we  try 
out  both  implementations,  in  parallel,  following  the  forcing  lists  L[i]  and  L[m  +  i], 
respectively.  This  is  equivalent  to  running  Assign  (L,  C,  i)  and  Assign  (L,  C,m  +  i)  in 
parallel  and  terminating  when  either  (a)  both  return  with  value  False  or  (b)  one  (or 
both)  return  with  value  True.  When  (a)  occurs,  we  have  an  infeasible  solution.  When 
(b)  occurs,  the  selections  made  by  the  branch  that  returns  True  are  used.  Note  that 
the  parallel  execution  of  Assign  (L,  C,  i)  and  Assign  (L,  C,m  +  i)  is  actually  done  via 
simulation  by  a  single  processor;  this  processor  alternates  between  performing  one 
step  of  Assign  (L,  C,  i)  and  one  of  Assign  (L,  C,m  +  i)  and  stops  when  one  of  the  two 
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conditions  (a)  or  (b)  occur.   In  case  of  (b),  we  proceed  with  the  next  module  with 
unselected  implementation. 

3.3.3    Implementation  Details 

To  implement  stage  2,  we  need  two  copies  of  the  implementation  selection  array  C; 
one  copy  for  each  parallel  execution  branch.  Call  these  copies  C\  and  C2.  Although 
both  are  identical  at  the  start  of  Assign  (L,Ci,i)  and  Assign  (L,C2,i),  C\  and  C2 
may  differ  later.  When  the  execution  of  these  two  branches  terminates,  we  need  to  set 
the  Ci  corresponding  to  the  unselected  branch  equal  to  that  of  the  selected  branch. 
This  is  done  efficiently  by  maintaining  two  lists  Ai  and  A2  of  changes  made  to  C\ 
and  C2  since  the  start  of  the  two  branches.  Then,  if  C\  is  selected,  we  can  use  A2  to 
first  convert  C2  back  to  its  initial  state  and  then  use  Ax  to  convert  it  from  the  initial 
state  to  C\.  If  C2  is  selected,  a  similar  process  can  be  used  to  convert  Ci  to  C2.  The 
time  need  for  this  is  |Ai|  +  |A2|  rather  than  |Ci|  =  |C2|  =  m  (as  would  be  the  case 
if  we  simply  copy  C\  to  C2  or  C2  to  C\). 

Further,  since  the  forcing  lists  are  shared  by  two  branches,  these  branches  should 
not  modify  the  forcing  lists.  Therefore  the  simulation  of  Assign  omits  the  steps  that 
remove  forcing  lists.  Finally,  to  efficiently  simulate  two  parallel  executions  of  Assign, 
we  need  to  convert  the  recursive  version  of  Figure  3.3  into  an  iterative  version.  Our 
iterative  code  which  simulates  the  parallel  execution  of  two  Assign  branches  employs 
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two  queues  Qx  and  Q2.  A  high  level  description  of  the  code  is  given  in  Figures  3.5, 

3.6  and  3.7. 

3.3.4     Time  Complexity 

To  construct  the  net  span  constraints'  portion  of  the  forcing  lists,  we  must  identify 
the  up  to  four  critical  modules  of  each  net  and  establish  the  forcing  constraints  for 
each  of  the  up  to  four  critical  module  pairs  that  determine  the  net  span.  The  critical 
modules  for  all  nets  can  be  determined  in  0(p)  time  by  making  a  left  to  right  sweep 
of  the  modules,  keeping  track,  for  each  net  i,  of  the  first  and  last  modules  in  the 
top  and  bottom  module  row  that  contain  net  i.  Since  all  pin  locations  and  module 
boundaries  are  integers,  the  modules  can  be  sorted  in  left  to  right  order  in  linear  time 
using  bin  sort  [24].  Each  net's  contribution  to  the  forcing  lists  can  now  be  determined 
in  0(1)  time.  Therefore,  representing  each  L[i]  as  a  chain,  the  net  span  constraints' 
contribution  to  the  L[i]s  can  be  determined  in  Q(p  +  n)  =  0(p)  time. 

To  construct  the  portion  of  L[i]  that  results  from  the  channel  density  constraint, 
we  partition  the  channel  into  regions  by  performing  a  left  to  right  sweep  of  the 
modules  and  using  the  module  end  points  as  region  boundaries.  The  number  of 
channel  regions  is,  therefore,  0(m).  In  our  implementation,  we  scan  the  channel 
four  times  to  compute  the  maximum  density  of  each  region  for  each  of  the  four 
possible  implementations  of  the  module  pair  that  bounds  the  region.  This  takes  0(p) 
time.  Once  we  have  the  densities  of  each  region  we  can,  given  the  density  constraint, 
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Algorithm  Boolean  Satisfy  (L[),C2[]) 
/*  Test  whether  L  is  satisfiable  */ 
Copy  C2  intoCiQ; 
for  i  =  1  to  m  do 

if  C\[i]  ==  0  then  /*  t  is  undecided  */ 
if  L[i]  is  empty  then 

C\[i]  =  C2[i]  =  1;  /*  select  first  implementation  */ 
else  if  L[m  +  i]  is  empty  then 

Ci[i]  —  C2[i]  —  2;  /*  select  second  implementation  */ 
else 
EnQueue  (Q\,i); 

EnQueue  (Q2.  m  +  i);  /*  m  +  i  represent  the  2nd  implementation  of  module  i  */ 
while  Q\  not  empty  and  Q2  not  empty  do 
a  —  DeQueue  (Qi); 
6=DeQueue  (Q2); 
if  a  is  rejected  in  C\  and  b  is  rejected  in  C2  then 

return  False; 
else  if  a  is  rejected  in  C\  then 
EnQueue  {Qi,a); 

if  not  Search  (L,  Q2,C2,&2,  b)  then 
return  False; 
else  if  6  is  rejected  in  C2  then 
EnQueue  {Q2,b); 

if  not  Search  (L,Q\,C\,  Ax,a)  then 
return  False; 
else 
if  a  is  undecided  in  C\  then 
Add  List  L[a)  into  Qi; 
Insert  a  into  Ai; 
Mark  a  selected  in  C\\ 
if  b  is  undecided  in  C2  then 
Add  List  L[b]  into  Q2; 
Insert  b  into  A2; 
Mark  6  selected  in  C2; 
end  while  /*  Qi  not  empty  and  Q2  not  empty  */ 
if  Q\  is  empty  then 

CMo  (C2,  A2)Ci,  AO;  /*  make  C2  =  Cx  */ 
else  /*  Q2  is  empty  */ 

Undo  (Cl,A1,C2,A2);  /*  make  C\  =  Q»  */ 
end  if  /*  L[t]  is  empty  */ 
end  if  /*  module  i  is  undecided  */ 
end  for 
return  True; 


Figure  3.5.  Function  Satisfy 
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Algorithm  Boolean  Search  {L.Q,C,A,  x) 

I*  Select  module  x  and  modules  in  Q  and  related  modules,  update  list  A  */ 
Mark  x  selected  in  C; 
Insert  x  into  A; 
Add  List  L[x]  into  Q; 
while  Q  not  empty  do 
y  =  DeQueue  {Q); 
if  y  is  rejected  in  C  then 

return  False; 
else  if  y  is  undecided  in  C  then 

Add  List  L[y]  into  Q; 
else  /*  y  is  rejected  in  C  */ 
Insert  y  into  A; 
Mark  y  selected  in  C\ 
end  while 
return  True; 


Figure  3.6.  Function  Search 


Algorithm  Undo  (Ci,  Ai,Q,As) 

/*  make  C\  -  C2  by  using  delta  lists  */ 
for  each  1GA1  do 

Mark  x  undecided  in  C\, 
for  each  x  G  A2  do 

Mark  x  selected  in  C\\ 


Figure  3.7.  Procedure  Undo 
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construct  the  forcing  lists  L[1..2m]  in  6(m)  time.  Notice  that  on  succeeding  iterations 
of  the  binary  search  for  an  optimal  solution,  only  the  contribution  to  L\\  from  the 
density  constraint  may  change.  The  new  contribution  to  L\\  can  be  determined 
without  recomputing  the  densities  of  each  region. 

The  limited  branching  method  of  stage  2  uses  two  queues  Qi  and  Q2-  The  time 
needed  to  add  (EnQueue)  or  delete  (DeQueue)  an  element  to/from  a  queue  is  9(1) 
[24].  In  each  iteration  of  the  for  loop  of  Figure  3.5,  the  time  spent  following  the 
successful  branch  equals  that  spent  following  the  unsuccessful  branch  and  the  time 
needed  to  make  C\  and  C2  identical  (i.e.,  the  cost  of  the  Undo  operation)  is,  asymp- 
totically, no  more  than  the  time  spent  following  the  successful  branch.  The  time 
spent  following  all  successful  branches  is  no  more  than  the  size  of  the  forcing  lists 
because  no  forcing  list  is  examined  twice.  Therefore,  the  stage  2  time  is  0{p). 

The  binary  search  for  the  minimum  density  solution  iterates  O(logn)  times. 
Therefore,  our  algorithm  finds  an  optimal  solution  to  the  2-PDMIS  problem  in 
O(plogn)  time. 

Comparing  our  algorithm  to  that  of  Her  et  al.  [11],  we  note  that  our  algorithm 
has  the  potential  of  identifying  infeasible  2-PDMIS  instances  quite  early;  that  is, 
during  the  construction  of  the  forcing  lists.  Although  infeasibility  resulting  from 
the  critical  modules  of  a  single  net  being  too  far  apart  are  detected  immediately  by 
both  algorithms,  our  algorithm  also  can  quickly  detect  infeasibility  resulting  from 
forced  selections  during  stage  1.  The  algorithm  of  Her  et  al.   [11]  does  not  do  this. 
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Because  of  the  calls  to  Assign  made  during  stage  1,  the  size  of  the  forcing  lists  to  be 
processed  in  stage  2  is  often  significantly  reduced.  As  a  result,  the  limited  branching 
operation  is  often  applied  to  much  smaller  data  sets  than  the  2-SAT  graph  on  which 
the  strongly  connected  component  algorithm  is  applied  [11].  These  factors  contribute 
to  the  observed  speedup  provided  by  our  algorithm  relative  to  that  of  Her  et  al.  [11]. 

3.4     Multichannel  2-PDMIS  Problem 

In  the  multichannel  2-PDMIS  problem,  we  have  c  +  1,  c  >  1  rows  of  modules. 
Each  module  has  pins  on  its  upper  and  lower  boundaries,  each  module  has  two 
possible  implementations,  there  is  a  routing  channel  between  every  pair  of  adjacent 
rows,  and  net  span  bounds  are  provided  for  every  channel  [11].  Although  Her  et  al. 
[11]  develop  a  heuristic  for  the  general  multichannel  PDMIS  problem,  they  do  not 
consider  polynomial  time  algorithms  for  the  multichannel  2-PDMIS  problem. 

For  any  fixed  channel  density  tuple  (dud7,. .  .,dc)  for  the  c  routing  channels, 
we  can  develop  the  forcing  lists  in  0(p)  time,  where  p  is  the  total  number  of  pins. 
These  lists  are  developed  using  ideas  similar  to  those  used  in  Section  3.3.  Then, 
using  the  limited  branching  method  of  Section  3.3,  we  can  determine,  in  0(p)  time, 
whether  it  is  possible  to  select  module  implementations  so  that  the  channel  densities 
do  not  exceed  (di,d2,  ■ . . ,  dc)  and  so  that  the  net  span  bounds  are  satisfied.  Thus, 
the  method  of  Section  3.3  is  easily  extended  to  obtain  an  0(p)  feasibility  test  for 
(di,d2,-  ..,dc).   Since  there  are  0(nc)  possible  density  vectors  (n  is  the  number  of 
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nets),  the  c  channel  2-PDMIS  problem  can  be  solved  by  trying  out  all  0{nc)  tuples 
in  0{pnc)  time. 

We  can  reduce  this  time  to  0(pnc_1)  as  follows.  When  c  =  2,  first  determine  the 
least  y  such  that  (§,  y)  is  a  feasible  channel  density  tuple.  This  is  done  using  a  binary 
search  on  d2  and  takes  O(logn)  feasibility  tests,  each  test  taking  0{p)  time.  We  can 
ignore  tuples  {dud2)  with  d1  <  §  and  d2  <  y  because  these  tuples  are  infeasible, 
and  we  can  ignore  tuples  (di,d2)  with  dx  >  \  and  d2  >  y  because  these  are  inferior 
to  (§,y)-  Therefore,  the  search  for  a  better  tuple  than  (!*,y)  may  be  limited  to  the 
regions  dx  <  §  and  d2  >  y,  and  di  >  §  and  d2  <  y.  These  two  regions  (Figure  3.8) 
may  now  be  searched  recursively.  For  example,  to  find  the  best  tuple  in  the  region 
di  <  |  and  d2  >  y,  find  the  least  z  such  that  (f ,  z)  is  feasible.  Now  search  the  two 
regions  d\  <  |  and  d2  >  z,  and  d\  >  |  and  d2  <  z.  for  a  better  tuple  than  {\,z). 

The  worst-case  number  of  feasibility  tests  for  the  above  search  strategy  is  given 
by  the  recurrence 

JV(n)  =  2iV(^)  +  logn,    n>2 

and  Ar(l)  =  1.  The  solution  to  this  recurrence  is  N(n)  =  O(n).  Since  each  feasibility 

test  takes  0(p)  time,  the  2-channel  2-PDMIS  problem  can  be  solved  in  0(pn)  time. 

By  doing  an  exhaustive  search  on  the  densities  of  c  —  2  channels  and  using  the 

above  technique  for  the  remaining  2  channels  (i.e.,  for  each  choice  of  densities  for 
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c  -  2  channels,  find  the  overall  best  choice  for  the  c  channels  as  above),  we  can  solve 
the  c-channel  2-PDMIS  problem  in  0(p  ■  nc~2  ■  n)  =  0(pnc~l)  time. 

n 


y 


di<\ 

d2>y 


infeasible 


eliminate 


d2<y 


»-  d\ 


n 
2 


Figure  3.8.  The  two  regions  to  be  searched  recursively  after  the  binary  search 


3.5    Experimental  Results 

We  implemented  our  algorithm  as  well  as  that  of  Her  et  al.  [11]  in  C  and  measured 
the  run  time  performance  of  the  two  algorithms  on  a  SUN  SPARCstation  5.  Our  first 
data  set  consists  of  benchmark  channels  used  in  Her  et  al.  [11].  We  partitioned  the 
top  row  and  bottom  row  of  the  channel  into  intervals  and  consider  these  intervals 
as  "modules",  and  assume  each  module  has  two  implementations.  Table  3.1  gives 
the  characteristics  of  these  circuits  as  well  as  the  time,  in  seconds,  taken  by  the  two 
algorithms.  The  optimal  densities  given  in  Table  3.1  differ  from  those  reported  [11] 
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because  the  partitioning  of  the  top  and  bottom  rows  of  pins  used  by  us  is  different 
from  that  used  in  Her  et  al.  [11].  The  speedup  provided  by  our  algorithm  ranges 
from  1.67  to  2.20.  Our  second  data  set  consists  of  circuits  designed  to  minimize  the 
size  of  the  forcing  lists  constructed  in  stage  1.  The  characteristics  of  these  circuits 
as  well  as  the  performance  of  the  two  algorithms  on  these  two  circuits  are  given  in 
Table  3.2.  Our  algorithm  is  9  to  11  times  as  fast  on  these  circuits. 

Table  3.1.  Running  time  for  benchmark  channels 


Channel 

n 

m 

P 

Optimal 
density 

Time/Second 

Speedup 

[11] 

Our 

exl 

21 

19 

74 

12 

0.0022 

0.0010 

2.20 

ex3a 

44 

36 

158 

14 

0.0046 

0.0023 

2.00 

ex3b 

47 

24 

158 

16 

0.0035 

0.0021 

1.67 

ex3c 

54 

23 

178 

18 

0.0039 

0.0023 

1.70 

ex4b 

54 

28 

192 

17 

0.0045 

0.0024 

1.88 

ex5 

64 

40 

190 

18 

0.0042 

0.0025 

1.68 

Table  3.2.  Running  time  for  generated  channels 


Channel 

n 

m 

P 

Time/Second 

Speedup 

[11] 

Our 

w32x32 

64 

66 

192 

0.0425 

0.0046 

9.24 

w64x64 

128 

130 

384 

0.0999 

0.0105 

9.51 

wl28xl28 

256 

258 

768 

0.2275 

0.0225 

10.11 

w256x256 

512 

514 

1536 

0.5130 

0.0487 

10.53 

w5 12x5 12 

1024 

1026 

3072 

1.1755 

0.1066 

11.03 

wl024xl024 

2048 

2050 

6144 

2.6150 

0.2309 

11.33 

w2048x2048 

4096 

4098 

12288 

5.6700 

0.4886 

11.60 

w4096x4096 

8192 

8194 

24576 

12.0500 

1.0280 

11.72 

w8192x8192 

16384 

16386 

49152 

24.8800 

2.1260 

11.70 

40 


3.6     Conclusion 

We  have  developed  an  O(plogn)  time  algorithm  for  the  single  channel  2-PDMIS 
problem  and  an  0(pnc~l)  time  algorithm  for  the  c,  c>  1,  channel  2-PDMIS  problem. 
Experiments  indicate  that  our  single  channel  algorithm  is  substantially  faster  than 
the  single  channel  algorithm  [11].  The  heuristic  proposed  in  Her  et  al.  [11]  for  the 
fc-PDMIS  problem,  k  >  2,  uses  the  algorithm  for  the  2-PDMIS  problem.  By  using 
our  2-PDMIS  algorithm,  the  &-PDMIS  heuristic  of  Her  et  al.  [11]  will  also  run  faster. 


CHAPTER  4 
GATE  RESIZING  TO  REDUCE  POWER  CONSUMPTION 


4.1     Introduction 

Power  consumption,  speed  and  area  are  three  important  and  related  characteris- 
tics of  a  circuit.  With  the  increase  in  circuit  density  and  the  enhanced  use  of  battery 
operated  devices,  the  emphasis  on  power  consumption  has  increased.  By  reducing 
power  consumption,  we  simultaneously  reduce  heat  dissipation  and  increase  battery 

life. 

In  this  paper  we  consider  the  problem  of  minimizing  the  power  consumed  by 
a  circuit  subject  to  satisfying  the  circuit's  timing  constraints.  Power  reduction  is 
obtained  by  gate  resizing  -  larger  gates  are  replaced  by  smaller  ones  that  have  higher 
delay  but  lower  power  consumption.  Power  reduction  via  gate  resizing  has  been 
considered  [3]. 

In  the  general  gate  resizing  problem  (GGR).  for  each  gate  in  the  circuit  we  have  a 
list  of  (delay,  capacitance)  pairs.  Each  pair  gives  the  delay  and  capacitance  associated 
with  a  possible  implementation  of  that  gate.  Since  the  power  consumed  by  a  gate  is 
linearly  proportional  to  the  product  of  its  capacitance  and  the  switching  activity  at 
its  inputs,  the  gate's  power  consumption  can  be  computed  from  its  capacitance  once 
the  circuit  characteristics  are  known.  Therefore,  we  assume  that  instead  of  (delay, 
capacitance)  pairs,  we  have  (delay,  power  consumption)  pairs.  In  this  model  we  ignore 
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the  change  in  power  from  load  change  and  switching  activity  change  due  to  change  of 
gate  delay.  The  same  assumption  has  been  used  in  Chen  and  Sarrafzedeh  [3].  In  the 
GGR  problem,  we  begin  with  a  realization  for  each  gate  (i.e.,  a  selection  of  a  (delay, 
power  consumption)  pair)  such  that  the  timing  constraints  are  satisfied.  We  wish  to 
change  the  realization  of  some  or  all  of  the  gates  by  replacing  their  assigned  pair  with 
one  that  has  larger  delay  (i.e.,  gate  resizing)  and  such  that  the  timing  constraints 
remain  satisfied  and  the  power  consumption  of  the  resized  circuit  (this  is  the  sum  of 
the  power  consumption  at  each  gate)  is  minimum.  The  GGR  problem  (referred  to 
as  the  incomplete  library  problem  [3])  is  equivalent  to  the  BCI  problem  studied  in  Li 
et  al.  [20].  The  BCI  problem  was  shown  to  be  NP-Complete  [20],  even  for  circuits 
that  were  simply  a  chain  of  single  input  single  output  gates.  Bahar  et  al.  [1]  propose 
a  greedy  heuristic  for  the  GGR  problem.  This  heuristic  resizes  one  gate  at  a  time. 
Chen  and  Sarrafzadeh  [3]  have  proposed  a  heuristic  that  resizes  several  gates  at  a  time. 
Experimental  results  presented  by  Chen  and  Sarrafzadeh  show  that  their  heuristic 
is  able  to  reduce  the  power  consumed  by  benchmark  circuits  by  approximately  10%. 
More  precisely,  the  method  of  Chen  and  Sarrafzadeh  [3]  did  worse  than  the  greedy 
method  of  Bahar  et  al.  [1]  by  5.3%  on  one  of  9  benchmark  circuits  and  did  better  by 
1.4%  to  20.6%  on  the  remaining  8  circuits. 

Chen  and  Sarrafzadeh  also  propose  a  pseudo  polynomial  time  algorithm  for  the 
Low  Power  Complete  Library-Specific  Gate  Resizing  (CGR)  problem.  In  this  prob- 
lem, each  gate  can  be  realized  to  have  any  delay  (delays  are  assumed  to  be  integral). 
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Further,  the  power  consumed  by  gate  v  decreases  by  the  constant  c(v)  for  each  unit 
increase  in  delay.  Let  di(v)  by  the  delay  of  the  initially  assigned  realization  of  gate 
v  and  let  d(v)  >  di(v)  be  the  delay  of  the  resized  gate  v.  Then  the  power  reduction 
&P  resulting  from  resizing  an  n  gate  circuit  is 

A/>  =  £c(t)(<*(t)  -4(0) 

In  Section  4.2.2  we  develop  a  linear  algorithm  for  the  CGR  problem  for  series- 
parallel  circuits.   In  Section  4.2.3  we  extend  this  linear  algorithm  for  series- parallel 
graphs  to  obtain  an  0(n  log2  n)  time  algorithm  that  works  when  there  is  an  upper 
bound  on  the  delay  of  each  gate.  That  is,  each  gate  v  has  realizations  with  integral 
delays  in  the  range  [d/(u),  du(v)].  As  in  the  CGR  problem,  each  unit  increase  in  delay 
reduces  power  consumption  by  c{v).  We  call  this  the  CUGR  problem.  The  CUGR 
problem  for  tree  circuits  can  also  be  solved  in  linear  time  (Section  4.3).  In  Section  4.4, 
we  show  that  the  CGR  problem  is  NP-Complete  for  circuits  that  have  a  special  type 
of  multi-input  multi-output  gate.  An  alternative  algorithm  for  the  CGR  problem  is 
presented  in  Section  4.5  .  This  algorithm  transforms  the  CGR  problem  to  an  activity- 
on-edge  network  [15]  and  then  uses  a  known  method  to  minimize  project  cost  [8]  to 
obtain  an  optimal  solution  to  the  CGR  problem.    The  approach  of  Section  4.5  is 
quite  general  and  can  be  used  even  for  the  CUGR  problem  and  when  the  c(v)s 
are  convex  functions  rather  than  constants.    In  Section  4.5  we  also  point  out  that 
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the  CGR  algorithm  of  Chen  and  Sarrafzadeh  does  not  work  for  CUGR  and  convex 
c(v)s.  In  Section  4.6  we  use  the  approach  of  Section  4.5  to  obtain  a  heuristic  for  the 
GGR  problem  with  convex  c(v)s.  Experimental  results  comparing  our  algorithm  of 
Section  4.5,  4.6  and  the  CGR  algorithms  of  Chen  and  Sarrafzadeh  [3]  are  presented  in 
Section  4.7.  Although  both  CGR  algorithms  generate  minimum  power  circuits,  our 
algorithm  does  this  using  significantly  less  time.  The  GGR  heuristic  we  developed 
obtains  better  power  reduction  in  many  circuits  than  the  one  developed  in  Chen  and 
Sarrafzadeh  [3]. 

Throughout  this  chapter  we  assume  that  a  circuit  is  represented  as  a  directed 
acyclic  graph.  The  vertices  of  this  graph  represent  gates  and  the  edges  represent 
signal  flow.  Primary  inputs  may  be  modeled  as  vertices  with  no  incoming  edge  and 
primary  outputs  may  be  modeled  as  vertices  with  no  outgoing  edge. 

Figure  4.1  gives  the  digraph  for  an  example  circuit.  The  vertices  corresponding 
to  primary  inputs  are  labeled  with  the  time  at  which  the  primary  input  is  available; 
vertices  corresponding  to  primary  outputs  are  labeled  with  the  time  by  which  the 
output  signal  must  arrive;  and  the  remaining  vertices  (these  correspond  to  circuit 
gates)  are  labeled  with  the  (delay,  power  consumption)  pair  corresponding  to  their 
initial  implementation. 
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(1,12)         (2,13) 


(2,15)  (3,21) 

Figure  4.1.  Digraph  corresponds  to  a  circuit 

4.2     Series-Parallel  Circuits 

4.2.1     Definition 

Series-parallel  circuits  were  considered  in  Li  et  al.   [20].  A  series-parallel  circuit 
may  be  defined  recursively  as  below:  [20] 

SP1:  a  chain  of  gates  is  a  series-parallel  circuit  (Figure  4.2(a)). 

SP2:  several  chains  of  gates  joined  at  the  ends  to  a  common  first  gate  and  a  common 
last  gate  (Figure  4.2(b))  define  a  simple  parallel  circuit.  A  simple  parallel  circuit 
is  a  series  parallel  circuit. 

SP3:  a  circuit  obtained  from  a  series-parallel  circuit  C  by  replacing  any  interconnect 
of  C  by  another  series-parallel  circuit  is  a  series-parallel  circuit  (Figure  4.2(c)). 

Figure  4.2  gives  example  series-parallel  circuits  as  well  as  a  circuit  that  is  not 
series-parallel. 
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■Q 


(a) 


(c) 


(d) 

Figure  4.2.  Circuit  Examples  (Source:  Li  et  al.  [20]).  (a)  Chain;  (b)  Simple  Parallel 
Circuit;  (c)  Series-Parallel  Circuit;  (d)  Non-Series-Parallel  Circuit 
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4.2.2     Complete  Library  Gate  Resizing  (CGR) 

Our  strategy  to  solve  the  CGR  problem  for  series-parallel  circuits  is  to  reduce  the 
circuit  to  one  that  has  a  single  gate.  The  CGR  problem  for  the  reduced  single  gate 
circuit  is  easily  solved,  and  finally  the  solution  to  this  single  gate  problem  is  used  to 
reconstruct  the  solution  for  the  initial  circuit. 

To  transform  an  arbitrary  series-parallel  circuit  into  an  equivalent  single  gate 
circuit  (i.e.,  a  single  gate  circuit  with  the  same  maximum  power  reduction),  we  first 
obtain  the  series  parallel  decomposition  of  the  circuit  using  the  linear  time  algorithm 
[33].  This  series- parallel  decomposition  essentially  tells  us  how  to  build  the  original 
circuit  using  chains  (SP1),  simple  parallel  circuits  (SP2),  and  replacing  interconnects 
of  C  by  series-parallel  circuits  (SP3).  During  this  rebuild  process,  we  shall  replace 
each  chain  and  simple  parallel  circuit  by  a  single  gate.  Consequently,  when  the  rebuild 
is  complete,  we  will  be  left  with  a  single  gate.  The  replacement  rules  for  chains  and 
simple  parallel  circuits  are  given  below: 

Chain:  Suppose  the  chain  has  n  gates  labeled  with  delays  e^,  d2>  •  •  •  > dn-  Let  q  be  the 
power  reduction  obtained  by  increasing  rf,  by  1.  The  signal  delay  through  the 
chain  is  ]£"=i  d,-.  For  each  unit  increase  in  signal  delay  over  ]P"-i  di,  the  max- 
imum possible  power  reduction  is  maxi<j<„{c,}.  Therefore  the  chain  is  equiv- 
alent to  a  gate  v  with  delay  T,i=i  di  and  c(v)  —  maxi<,<„{c,}  (Figure  4.3(a)). 
The  power  reduction  AP  obtained  by  making  this  replacement  is  zero. 
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Simple  Parallel  Circuit:  First  transform  each  chain  in  the  simple  parallel  circuit  into 
an  equivalent  single  gate  using  the  transformation  of  Figure  4.3(a).  This  results 
in  the  parallel  circuit  of  Figure  4.3(b).  The  signal  delay  between  the  output  of 
gate  s  and  the  input  of  gate  t  is  maxi<j<n{dj}-  We  can  increase  the  delays  of 
all  gates  between  s  and  t  to  maxi<j<n{dj}  without  affecting  the  overall  circuit 
delay.  This  gives  us  a  power  reduction  AP  -  £"=i  Cj  (max! <_,•<„  {dj- }  - <£•).  Each 
unit  increase  in  delay  between  s  and  t  beyond  maxi<,<„{dj}  gives  us  a  power 
reduction  of  £"=i  c*.  Therefore  the  n  gates  between  s  and  t  are  equivalent 
to  a  single  gate  with  delay  max^xnfd,}  and  c(v)  =  £"=iC,-  Consequently, 
a  simple  parallel  circuit  may  be  replaced  by  the  three  gate  chain  shown  in 
Figure  4.3(b).  This  chain  can,  in  turn,  be  replaced  by  a  single  gate  using  the 
chain  transformation  of  Figure  4.3(a). 

Using  the  above  transformations  on  the  series  parallel  decomposition  yields  a 
single  gate  circuit.  The  input  to  this  gate  is  the  primary  input  of  the  original  circuit 
and  the  gate  output  is  the  primary  output  of  the  original  circuit.  Let  v  be  the  single 
gate  that  remains.  Let  U  and  t0  be  the  arrival  time  of  the  input  and  the  required  time 
for  the  output  respectively.  Since  we  start  with  a  circuit  that  can  meet  its  arrival 
time  requirements  (i.e.,  a  feasible  circuit)  and  since  the  transformations  of  Figure  4.3 
do  not  affect  feasibility,  U  +  d(v)  <  t0.  The  additional  power  reduction  possible  is 
(t0  ~  U  -  d(v))c(v)  The  maximum  power  reduction  APmax  for  the  original  circuit  is 
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AP  =  0 

(a) 


max(c,) 
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Cn 


AP  =  EJL,  Ci(max(dj)  -  <U) 
(b) 


Figure  4.3.  Transformation  of  Series-Parallel  Circuits,  (a)  Chain;  (b)  Simple  Parallel 
Circuit 

(to  ~U~  d{v))c(v)  +  sum  of  the  APs  from  the  simple  parallel  circuit  transformation 
(Figure  4.3(b)). 

To  obtain  the  delay  values  for  each  gate  of  the  original  circuit  that  will  result  in  a 
power  reduction  of  APmax,  simply  follow  the  reduction  process  backwards.  The  total 
time  taken  is  linear  in  the  number  of  gates  in  the  original  circuit.  Figures  4.4  and 
4.5  show  an  example.  Each  gate  is  represented  by  a  box,  the  number  inside  a  box 
is  the  gate  delay,  the  number  below  a  box  is  the  gate's  c  value,  the  primary  input  is 
available  at  time  0,  and  the  primary  output  is  needed  at  time  37. 

4.2.3     Complete  Library  with  Upper  Bounds  (CUGR)  and  Convex  c(v )s 

The  power  reduction  per  unit  delay  increase  function  c(v)  for  gate  v  is  convex  iff 
there  exist  positive  <$i,  <$2, ...  ,Sm  and  C\  >  c<i  >  ■  •  ■  >  Cm  such  that  c(v)  =  c\  for  delay 
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(b)     AP2  =  6*5  +  2*2  +  3*3  =  43 
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(d)    AP4  =  8  *  15  =  120 
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(e)    AP5  =  21  *  Ad  =  168 
Figure  4.4.  Transformation  of  a  Series-Parallel  Circuit  into  a  single  gate. 
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Figure  4.5.  Computation  of  new  delay  for  each  gate. 
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increases  between  0  and  6\:  c(v)  =  c2  for  delay  increase  between  61  +  1  and  d"i  +  <52; 
c(v)  =  c3  for  delay  increases  between  Si+S2  +  l  and  (^ +d2-t-53:  and  so  on.  Figure  4.6 
shows  the  power  consumption  as  a  function  of  the  increase  in  delay  relative  to  the 
gate's  initial  delay  di(v).  P0  is  the  power  consumption  when  the  gate  has  its  initial 
delay  dt(v). 

Power  Consumption 

slope  =  —c\ 

slope  =  — c2 

"^  slope  =  -Cn 

— »-  Delav  Increase 


0    Sx     Si+62  Z?^S, 

Figure  4.6.  Convex  delay-power-consumption  graph 


The  CUGR  problem  can  be  modeled  using  gates  with  convex  power  reduction 
functions  c(v).  For  example  if  gate  v  provides  a  power  reduction  of  c  for  each  unit 
increase  in  delay  between  dt(v)  and  du(v)  then  we  may  use  c(v)  with  6\  —  du(v)—di(v) 
and  c\  =  c.  Because  of  this  correspondence  between  the  CUGR  problem  and  the 
convex  gate  resizing  problem  ConvexCGR,  we  consider  only  the  ConvexCGR  problem 
in  this  section. 


53 

To  solve  the  ConvexCGR  problem  for  series-parallel  circuits,  we  need  only  develop 
methods  to  transform  a  chain  of  convex  gates  into  an  equivalent  convex  gate  and 
to  transform  a  simple  parallel  circuit  comprised  of  convex  gates  into  an  equivalent 
convex  gate.  These  transformations  can  then  be  used  in  place  of  the  transformations 
of  Section  4.2.2  to  obtain  an  algorithm  for  the  ConvexCGR  (and  hence  also  for 
the  CUGR)  problem.  We  assume  that  a  convex  gate  is  given  by  a  list  of  tuples 
{(Si,  ci),  (S2,  c2), .  •  • ,  (Sm,  Cm)},  Ci>c2>  ■■■>cm.  This  list  is  called  the  DP  (delay- 
power)  list  of  the  gate. 

Chain  of  Convex  Gates:  A  chain  of  n  convex  gates  with  initial  delays  du  d2, dn  is 

replaced  by  a  convex  gate  with  initial  delay  £"=1  a\  as  in  Figure  4.3(a).  The  DP 
list  for  the  new  gate  is  obtained  by  merging  the  DP  lists  of  the  n  gates  in  the 
original  chain  into  a  single  list  sorted  by  non-increasing  c,s.  During  this  process, 
pairs  with  the  same  c*  value  are  combined  into  a  single  pair.  For  example  if 
(5, 24)  and  (2, 24)  are  pairs  in  DP\  and  DP2  ,  respectively,  the  combined  pair 
is  (7, 24).  Suppose  we  have  a  3  gate  chain  with  DPX  =  {(3, 28),  (5, 24),  (3, 21)}, 
DP2  =  {(2, 24),  (4, 23)}  and  DP3  =  {(9, 26)}.  The  DP  list  for  the  replacement 
gate  for  this  chain  is  {(3, 28),  (9, 26),  (7, 24),  (4, 23),  (3, 21)}. 

Simple  Parallel  Circuit  with  Convex  Gates:  First  the  chains  in  the  circuit  are  trans- 
formed into  equivalent  single  convex  gates.  Then  the  delays  of  these  equivalent 
convex  gates  are  increased  to  maxi<j<n{dj}  (see  Figure  4.3(b)).  This  increase 
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in  delay  provides  us  a  power  reduction  £P  and  changes  the  tuples  at  the  front 
of  the  DP  lists  of  the  gates  whose  delay  is  increased.  Let  the  new  DP  lists 
be  DP[,  DP'2, . ..,  DP'n.  Now  the  gates  between  s  and  t  are  replaced  by  a  gate 
with  delay  maxi<j<„{dj}.  The  DP  list  of  the  gate  may  be  obtained  from  the 
DPI  lists.  Figure  4.7  shows  the  process  when  n  =  2.  Here  L\  and  L2  denote 
the  two  DP  lists  DP[  and  DP'2  and  L  denotes  the  DP  list  of  the  replacement 
convex  gate.  Finally,  s,  t  and  the  replacement  gate  are  combined  into  a  single 
convex  gate  using  the  method  for  a  chain  of  convex  gates.  Suppose  we  have 
two  parallel  convex  gates  with  delays  3  and  2  respectively  and  the  correspond- 
ing DP  lists  DPX  =  {(3, 28),  (5, 24),  (3, 21)}  and  DP2  =  {(2,24),  (4,23)}.  The 
delay  of  the  equivalent  convex  gate  is  thus  3,  which  reduces  the  power  con- 
sumption of  the  second  gate  by  24.  The  DP  list  of  the  second  gate  is  modified 
to  {(1, 24),  (4, 23)}.  By  performing  the  Parallel-Merge  operation  we  obtain  the 
DP  list  {(1, 52),  (2, 51),  (2, 48),  (3, 24),  (3, 21)}  for  the  equivalent  gate. 

4.2.4    Time  Complexity  of  Convex  GR  Problem 

A  straightforward  implementation  of  our  algorithm  of  the  previous  section  for  the 
ConvexCGR  problem  uses  sorted  chains  (linked  lists)  to  represent  the  DP  lists.  The 
time  needed  to  combine/merge  the  DP  lists  Lx  and  L2  of  two  gates  is  0(|Li|  +  |Z^|) 
regardless  of  whether  we  do  a  series  or  parallel  merge.  If  we  start  with  gates  having 
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Algorithm  Parallel-Merge(Li,  L2) 

/*  Merge  Lx  and  L2,  consider  two  gates  in  parallel  */ 

Pi  <—  head(Li): 
p2  4—  head(L2): 
L  <-  NULL; 

while  L\  not  empty  and  L2  not  empty  do 
if  6{pi)  <  6(pz)  then 
Insert  [6{pi),  cfjh)  +  c(pj))  into  L; 

P!  <-  nex*(pi); 
else  if  6(j>i)  >  S{p2)  then 

/*  Similar  to  the  "if"  part  above,  interchange  pi  and  p2-  */ 
else  /*  *(Pi)  =  Sip,)  */ 

Insert  (<5(pi),c(pi)  +  cfo))  into  I; 
Pi  «-  nerf(pi); 
P2  +-  nerf(p2); 
end  if 
end  while 
if  L\  is  empty  then 

Append  the  remaining  nodes  of  L2  starting  from  p2  to  L; 
else 

Append  the  remaining  nodes  of  L\  starting  from  p\  to  L; 
end  if 
return  L\ 


Figure  4.7.  Algorithm  Parallel-Merge 
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DP  lists  with  A:  tuples  each,  then  the  time  needed  to  transform  an  n  gate  series- 
parallel  circuit  into  its  equivalent  single  gate  is  0{kn2).  To  see  this,  observe  that  each 
series/parallel  combine  step  reduces  the  number  of  gates  by  at  least  1.  Therefore, 
there  can  be  at  most  n  -  1  combine  steps.  Further,  after  q  combines,  the  size  of  a 
DP  list  is  0{kq)  =  O(kn).  So  the  cost  of  0{n)  combines  is  0(kn2).  An  example 
circuit  that  exhibits  kn2  worst-case  behavior  is  given  in  Figure  4.8. 


Figure  4.8.  Worst-case  merging  of  n  gates 

We  can  reduce  the  asymptotic  time  complexity  to  0{kn  log2  n)  by  using  balanced 
binary  search  trees  (BBSTs)  [15]  to  represent  the  DP  lists.  Each  DP  list  is  repre- 
sented by  a  BBST  such  that  the  external  nodes  represent  the  pairs  {5uCi)  in  the  DP 
list  in  right  to  left  order  (i.e.,  in  decreasing  order  of  power  reduction).  Each  internal 
node  x  contains  a  triple  of  the  form  {D(x),  C(x),  M(x)),  where  D(x)  is  the  sum  of  the 
delays  of  the  DP  list  pairs  in  the  left  subtree  of  x,  C(x)  is  a  corrective  factor  needed 
to  compute  the  c,  values  of  pairs  in  the  left  subtree  of  x,  and  M(x)  is  a  pointer  to 
the  rightmost  external  node  in  the  left  subtree  of  x.  Each  external  node  y  stores  a 
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pair  {d{y),  c(y))  such  that  d{y)  is  the  S  value  of  the  DP  list  pair  represented  by  node 
y:  the  c  value  of  this  DP  list  pair  is 

c(y)  +  £  C(x) 

{z  :  y  is  in  the  left  subtree  of  x} 

Figure  4.9  shows  a  possible  BBST  for  the  DP  list  {(3,28),  (5,24),  (3, 21)}.  The 
leftmost  external  node  contains  the  pair  (3,13)  which  represents  the  DP  list  pair 
(3.21).  The  correct  c  value  for  the  DP  list  pair  is  obtained  by  adding  to  13  the  C 
values  in  the  ancestors  of  the  external  node. 


(3,13)  (5,14) 

Figure  4.9.  BBST  used  to  represent  DP  list  {(3, 28),  (5, 24),  (3, 21)} 


To  insert  a  new  DP  list  pair  into  our  BBST,  we  must  be  able  to  trace  a  path 
from  the  root  to  an  appropriate  external  node.  This  path  tracing  is  facilitated  by  the 
pointer  M(x)  in  internal  node  x.  By  using  the  c()  value  in  the  external  node  M(x) 
and  the  C  values  in  the  nodes  from  the  root  to  x,  we  can  compute  the  maximum  c 
value  of  any  DP  pair  in  the  left  subtree  of  x. 
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Since  insertions  require  rotations,  we  show  how  the  D()  and  C()  values  in  internal 
nodes  are  to  be  changed  when  rebalancing  rotations  are  done  (Figure  4.10).  Note 
that  Af()  values  remain  unchanged  during  tree  rotations,  and  are  omitted  from  the 
figure.  The  tuple  (D(x),C(x))  of  each  internal  node  x  is  shown  next  to  the  node. 

To  merge  the  DP  lists  of  two  gates  in  a  chain,  we  first  perform  an  inorder  traversal 
of  the  smaller  DP  list's  BBST  to  extract  the  DP  list's  pairs.  Then,  these  pairs  are 
inserted  into  the  BBST  for  the  larger  DP  list.  During  this  insertion,  pairs  with  the 
same  c  value  are  combined  into  a  single  pair.  If  the  two  DP  lists  are  Lx  and  L2,  the 
time  needed  to  do  the  series  merge  is  |Lijlog(|Za|  +  \L2\),  where  Lx  is  the  smaller 
DP  list. 

For  a  parallel  merge  of  two  DP  lists  Lx  and  L2  {Lx  is  the  smaller  list),  we  need 
to  identify  for  each  {Sk,ck)  in  Lu  the  external  nodes  z  in  the  BBST  for  L2  for  which 
YXzl  ^  <  f{z)  <  Ef=i  h,  where  the  <5jS  are  defined  with  respect  to  Lx  and  f(z)  is 
the  sum  of  the  d ()  values  of  the  external  nodes  in  the  BBST  of  L2  that  lie  to  the  left 
of  z  plus  the  d()  value  of  z.  Let  x  and  y  be  the  leftmost  and  rightmost  such  external 
nodes  (see  Figure  4.11).  These  nodes  can  be  found  in  0(log|L2|)  time  using  Efji1^ 
and  Ef=i  <5ii  an(J tne  D  values  in  the  triples  of  the  internal  nodes  of  the  BBST  for  L2. 
Actually,  node  x  may  already  be  known  from  the  processing  of  the  pair  (5fc-i,c*_i) 
of  L\.  We  need  to  increase  the  c  values  of  the  external  nodes  from  x  to  y.  This  can 
be  done  in  logarithmic  time  by  changing  the  C  correctors  stored  in  the  internal  nodes 
on  the  paths  from  x  and  y  to  their  common  ancestor  (see  Figure  4.11). 
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Figure  4.10.  Update  of  D()s  and  C()s  for  internal  nodes  during  tree  rotations 
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common  ancestor  of  x  and  y 


Figure  4.11.  Change  of  c  values  of  internal  nodes  of  L2  for  the  fcth  tuple  of  L\ 
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In  addition  to  the  above  change  in  C  correctors,  we  may  need  to  insert  a  new 
external  node.  If  f{y)  =  E*=1  <*i»  then  no  insertion  is  needed,  we  increase  c[y)  by  ck. 
Otherwise,  we  change  the  original  node  to  (£?=1  <5,  -  f(y)  -  d{y),c(y)  +  c{k))  and 
insert  (/(y)  -  £?=i  6{,ck  +  c{y))  into  the  BBST.  The  inserted  external  node  is  the  x 
node  for  the  next  pair  of  L\ . 

When  the  BBST  method  is  used  on  the  worst-case  example  for  the  linked  list 
method  (Figure  4.8),  |Ii|  =  k  and  \L2\  <  kn  for  each  of  the  n  -  1  merges.  Therefore, 
the  run  time  is  0{kn\ogkn).  The  worst-case  for  the  BBST  method  is  when  we 
continually  merge  DP  lists  of  the  same  size.  The  worst  case  is  described  by  logn 
stages  of  merges;  each  stage  involving  pairs  of  DP  lists  of  the  same  size.  In  stage 
1,  |  pairs  of  lists  each  of  size  k  are  merged  in  0(%klog2k)  time  to  produce  *  DP 
lists  of  size  2k  each;  in  stage  2,  |  pairs  of  DP  lists  of  size  2k  each  are  merged  in 
0(-2A:log4Jt)  time  to  produce  |  DP  lists  of  size  4A:  each;  and  so  on.  The  total  time 
is  0(Ei°ii"  f  Arlogfcn)  =  0(kn  log  kn  logn). 

Figure  4.12  shows  a  circuit  on  which  this  worst-case  bound  is  achieved.  Let  C0 
be  a  circuit  with  a  single  module.  Q,i  >  0,  is  a  simple  parallel  circuit  obtained 
from  Cj_i  as  shown  in  Figure  4.12.  The  number  of  modules  in  C\  is  2,+1  -I-  i(i  -  1) 
and  Q  requires  2'-1  parallel  merges  and  2*  —  1  series  merges.  The  total  cost  is 
0(A;n  log  fcn  logn). 
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Figure  4.12.  Circuit  C2  that  exert  the  worst  case  behavior 


4.3    Tree  Circuits 


Gates  in  circuits  with  a  tree  topology  (for  example,  distribution  trees)  can  be 
resized  by  transforming  the  trees  into  equivalent  single  gate  circuits  using  the  ba- 
sic transformation  shown  in  Figure  4.13.  The  transformation  of  Figure  4.13  first 
transforms  a  node,  all  of  whose  children  are  leaves,  into  an  equivalent  simple  parallel 
circuit  by  the  introduction  of  additional  gates/nodes  with  delay  r-r,,  where  r,  is  the 
required  time  for  the  output  of  leaf  i  and  r  =  max(ri).  The  c  values  for  the  new  gates 
are  0.  The  simple  parallel  circuit  can  now  be  transformed  into  an  equivalent  single 
gate  using  the  transformation  of  Figure  4.3.  By  repeatly  applying  this  transformation 
on  any  tree,  the  tree  can  be  transformed  into  an  equivalent  single  gate. 

Although  the  preceding  transformation  was  described  specifically  for  the  CGR 
problem,  it  is  easily  extended  to  the  CUGR  and  ConvexCGR  problems  using  the 
ideas  of  Section  4.2. 
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Figure  4.13.  Transformation  of  a  basic  tree  to  a  simple  parallel  circuit 
4.4    CGR  with  Multieate  Modules  Is  XP-Hard 

Suppose  that  a  circuit  is  to  be  realized  with  modules  that  contain  multiple  gates. 
Increasing  the  delay  of  a  module  results  in  an  increase  in  the  delay  of  all  gates  on  the 
module.  Figure  4.14  shows  a  module  v  with  two  gates  .4  and  B,  each  is  a  two-input 
one-output  gate.  The  delay  of  the  selected  module  implementation  is  d{v)  and  each 
unit  increase  in  module  delay  reduces  power  consumption  by  c{v)  and  increases  the 
delay  of  both  A  and  B  by  one  unit.  We  shall  show  that  the  CGR  problem  with 
multigate  modules  (MCGR)  is  NP-hard.  For  the  proof,  we  show  that  if  MCGR  can 
be  solved  in  polynomial  time,  then  the  one-in-three  3SAT  problem  [9]  can  also  be 
solved  in  polynomial  time. 


A 

D 


(d(v),c(v)) 


V 


Figure  4.14.  A  module  v  with  two  gates  .4  and  B 


Definition  1  (one-in-three  3SAT) 
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Input  Collection  of  clauses  CX,C2,  ■  ■  ■ ,  Cm  over  variables  xu  x2,  ...,  xn  such  that 
each  clause  is  the  disjunction  of  exactly  three  literals. 

Output  "yes"  if  and  only  if  there  is  a  truth  assignment  to  the  variables  such  that 
each  clause  has  exactly  one  true  literal. 

Theorem  1  MCGR  is  NP-hard. 

Proof.  We  show  how  to  transform,  in  polynomial  time,  any  instance  /  of  the  one-in- 
three  3SAT  problem  into  an  instance  /'  of  MCGR  problem  such  that  the  maximum 
power  reduction  for  V  is  (2m  +  l)n  +  m  if  and  only  if  the  answer  to  /  is  'yes".  Here 
m  is  the  number  of  clauses  in  /  and  n  is  the  number  of  variables. 

For  the  transformation,  we  define  two  circuit  subassemblies  -  variable  subassembly 
and  clause  subassembly.  A  variable  subassembly  consists  of  two  multigate  modules, 
each  having  two  gates  and  connected  as  in  Figure  4.15. 
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Figure  4.15.  Variable  subassembly  for  variable  x, 

The  first  module  of  the  variable  subassembly  for  variable  Xj  is  called  module  Xi 
and  the  second  is  module  xl.  The  inputs  to  gates  A  and  B  of  module  x{  and  to  gate  B 
of  xl  are  primary  inputs  which  are  available  at  time  0.  One  of  the  inputs  to  gate  A  of 
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module  x7  is  a  primary  input  available  at  0  and  the  other  input  is  the  output  of  gate 
A  of  module  Xj.  The  output  of  gate  .4  of  module  x7  is  a  primary  output  which  has  a 
required  arrival  time  of  1.  The  outputs  of  the  two  B  gates  are  non-primary  outputs. 
The  c  value  for  each  module  is  2m  +  1  and  the  initial  delay  of  each  is  0.  Notice  that 
we  can  increase  the  delay  of  either  module  x,  and  xl  (but  not  both)  by  1  and  still 
satisfy  the  arrival  time  requirements  of  the  primary  output.  Therefore,  the  maximum 
power  reduction  obtainable  from  a  variable  subassembly  is  2m  +  1.  If  module  x,  has 
delay  0,  then  we  say  that  literal  x<  is  true;  otherwise  x,  is  false.  Similarly,  if  module 
x7  has  delay  0.  the  literal  x~  is  true;  otherwise  x~  is  false.  Although  we  can  assign 
delays  to  the  two  modules  so  that  both  literals  are  false,  delay  assignments  can  make 
at  most  one  literal  true. 

We  construct  one  variable  subassembly  for  each  of  the  n  variables  in  the  3SAT 
instance  /.  The  maximum  power  reduction  obtainable  from  these  n  subassemblies  is 
(2m  +  l)n. 

A  clause  subassembly  consists  of  3  modules  with  one  gate  each;  each  gate  has  3 
inputs  and  1  output.  Let  /i,  l2  and  /3  be  the  three  literals  in  a  clause.  Figure  4.16 
shows  the  corresponding  clause  subassembly  together  with  the  inputs  to  each  gate. 
These  inputs  are  the  outputs  of  the  variable  subassemblies.  The  outputs  of  the 
modules  of  the  clause  subassemblies  are  primary  outputs  with  required  time  of  1. 
The  c  value  for  each  module  in  a  clause  subassembly  is  1. 
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Figure  4.16.  Clause  subassembly  for  (h  V  h  V  /3) 

The  maximum  power  reduction  obtainable  from  a  clause  subassembly  is  3.  This 
corresponds  to  the  case  when  all  6  literals  (h,  h,  h,  h,  h  and  /3)  are  false.  We  shall 
have  one  clause  subassembly  for  each  of  the  m  clauses  in  /.  Therefore  the  maximum 
power  reduction  available  from  all  m  clause  subassemblies  is  3m. 

The  circuit  /'  is  comprised  of  the  n  clause  and  m  variable  subassemblies  described 
above. 

If  there  is  a  truth  assignment  T  for  the  variables  of  /  such  that  the  answer  to  the 
one-in-three  3SAT  instance  /  is  "yes" ,  then  make  the  delay  of  module  x,  1  if  x,  is 
true  in  T,  otherwise  make  the  delay  of  x\  1.  Further,  since  exactly  one  literal  is  true 
in  each  clause  of/,  we  can  make  the  delay  of  exactly  1  module  in  each  clause  1.  The 
total  power  reduction  is  (2m  +  l)n  +  m. 
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Now  suppose  there  is  a  solution  S  to  /'  which  gives  us  a  power  reduction  > 
(2m  +  l)n  +  m.  In  5,  each  variable  subassembly  must  have  exactly  one  module  with 
delay  1.  To  see  this,  observe  that  no  variable  subassembly  can  have  two  modules 
with  delay  1  and  if  any  variable  subassembly  has  no  module  with  delay  1,  the  power 
reduction  obtained  by  S  is  at  most  (2m  +  l)(n  -  1)  +  3m  <  (2m  +  l)n  +  m.  So, 
assume  that  each  variable  subassembly  has  exactly  one  module  with  delay  1.  This 
means  that  we  have  a  consistent  truth  assignment;  that  is.  there  is  no  variable  Xj  for 
which  either  both  Xi  and  x7  are  true  or  both  are  false. 

Now  let's  determine  the  power  reduction  obtainable  from  the  clause  subassemblies. 
If  li  is  the  only  literal  that  is  true  among  li,  /2  and  /3,  we  can  make  the  delay  of  the 
topmost  module  1  because  the  arrival  times  of  o(l\),  o(/2)  and  o(/3)  axe  all  0.  The 
delays  for  the  remaining  two  modules  must  be  0  because  the  arrival  time  of  o(xi)  is  1. 
A  similar  analysis  can  be  done  for  the  cases  when  /2  and  /3  are  the  only  true  literals. 
We  conclude  that  when  exactly  one  literal  of  a  clause  is  true,  we  can  get  a  power 
reduction  of  at  most  1  from  its  clause  subassembly. 

Therefore  we  can  get  at  most  m  units  of  power  reduction  from  the  m  clause  sub- 
assemblies. The  maximum  of  m  is  obtained  only  when  exactly  one  literal  of  each 
clause  is  true.  Hence  the  solution  to  the  MCGR  instance  /'  provides  a  power  re- 
duction >  (2m  +  l)n  +  3m  if  and  only  if  the  one-in-three  3SAT  instance  has  answer 
"yes".  □ 
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4.5     General  Circuits 
4.5.1     The  CGR  Algorithm  of  Chen  and  Sarrafzadeh 

Chen  and  Sarrafzadeh  [3]  have  proposed  a  pseudo  polynomial  time  algorithm  for 
the  CGR  problem.  Let  a(v)  denote  the  arrival  time  of  the  signal  at  the  output  of 
gate  v  and  let  r(v)  denote  the  required  time  for  the  signal  at  the  output  of  gate  v. 
For  a  primary  input,  a(v)  is  the  time  at  which  the  signal  becomes  available  and  for 
a  primary  output  r(v)  is  the  required  time  for  that  output.  Hence  a(v)  is  known  for 
primary  inputs  and  r(v)  is  known  for  primary  outputs.  The  remaining  a's  and  r's 
are  defined  as  below  (d(v)  is  the  assigned  delay  of  gate  v): 


a(v)    =    ,   max     (a(ti)+d(r))  (4.1) 

{«:(«,«}€£] 

r(v)    =        min     (r{w)  -  d{w))  (4.2) 

{w.(v,w)eE} 


Hence  a{v)  is  the  length  of  the  longest  delay  path  from  the  primary  inputs  to  the 
output  of  u,  and  r(v)  is  the  latest  time  by  which  the  signal  must  arrive  at  the  output 
of  gate  v  so  that  it  is  still  possible  for  the  signal  to  reach  the  primary  outputs  by 
their  required  times.  Note  that  a(v)  and  r(v)  are  very  closely  related  to  the  early 
and  late  event  times  for  activity  networks  [5].  The  slack,  s(v),  of  gate  v  is 

s{v)  =  r{v)  -  a(v)  (4.3) 
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A  circuit  is  feasible  if  and  only  if  all  primary  outputs  arrive  by  their  required  time. 
From  the  definitions  of  r{v)  and  a{v),  it  follows  that  a  circuit  is  feasible  if  and  only 
if  s(v)  >  0  for  all  v. 

Figure  4.17(a)  shows  an  example  circuit  graph.  This  corresponds  to  a  circuit 
with  three  gates,  2  primary  inputs  and  2  primary  outputs.  The  a()  values  for  the  two 
primary  inputs  are  0  and  2  respectively.  The  r()  values  for  the  two  primary  outputs 
are  5  and  4  respectively.  The  3  gates  a,  b  and  c  are  shown  by  boxes.  The  c()  value 
of  a  gate  is  given  below  the  box.  The  selected  delay  for  each  gate  is  1  and  is  shown 
inside  the  box.  The  a(v)\s(v)\r{v)  values  for  each  gate  are  given  above  the  box. 


Figure  4.17.  Application  of  algorithm  of  Chen  and  Sarrafzadeh  [3]  on  a  CGR  circuit, 
(a)  An  example  CGR  circuit;  (b)  sensitive  graph 


The  algorithm  of  Chen  and  Sarrafzadeh  comprises  the  following  steps: 


Step  1:  Compute  the  slack  for  each  node  of  G. 


Step  2:  If  no  node  has  slack  >  0.  stop. 


Step  3:  Compute  the  sensitive  graph  Gs  from  G  as  follows.  Gs  contains  exactly  those 
vertices  of  G  that  have  slack  >  0.  (ti.  v)  is  an  edge  of  Gs  if  and  only  if  either 
a{v)  -  a{u)  =  d{v)  or  r{v)  -  r(u)  =  d(v).  The  weight  of  a  vertex  in  Gs  is  its 
c()  value. 

Step  4:  Compute  the  transitive  closure  graph  Gt  of  Gs. 

Step  5:  Compute  a  maximum  weighted  independent  set  of  Gt. 

Step  6:  Increase  the  delays  of  all  gates  in  the  maximum  weighted  independent  set  by 
1. 

Step  7:  Go  to  Step  1. 

The  maximum  weighted  independent  set  of  Gt  may  be  computed  using  a  maxflow 
algorithm  [23].  This  takes  0(nmlog(n7m))  time  for  a  graph  with  n  vertices  and 
m  edges.  Since  the  algorithm  of  Chen  and  Sarrafzadeh  [3]  may  reduce  the  power 
consumption  by  only  one  unit  on  each  iteration,  its  complexity  is  0(5nmlog(n2/m)) 
where  S  is  the  obtained  power  reduction. 

Figure  4.17(b)  shows  the  graph  Gs  that  corresponds  to  the  graph  G  of  Fig- 
ure 4.17(a).  The  numbers  inside  the  vertices  are  their  weights.  The  transitive  closure 
graph  Gt  for  Gs  is  the  same  as  Gs,  and  the  maximum  weighted  independent  set  is 
{b,c}  with  a  weight  of  14  +  13  =  27.  The  delays  of  gates  b  and  c  are  increased  by  1 
to  obtain  a  power  reduction  of  27  and  we  proceed  to  the  second  iteration. 
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4.5.2     Comments  on  the  Algorithm  of  Chen  and  Sarrafzadeh 

1.  Although  we  expect  the  algorithm  of  Chen  and  Sarrafzadeh  [3]  to  be  quite 
efficient  on  circuits  for  which  only  a  small  power  reduction  is  possible  (i.e.,  S  is 
small),  it  is  not  expected  to  be  efficient  on  circuits  whose  power  consumption 
can  be  significantly  reduced.  For  example,  consider  the  one-gate  circuit  of 
Figure  4.18.  The  arrival  time  of  the  primary  input  is  0,  the  required  time  of 
the  primary  output  is  r,  and  the  initial  gate  delay  is  0.  The  algorithm  [3]  takes 
r  iterations  to  complete.  We  would  like  an  algorithm  that  can  increase  gate 
delay  by  more  than  1  on  each  iteration.  In  particular,  it  should  be  possible  to 
obtain  the  optimal  solution  for  the  circuit  of  Figure  4.18  in  one  iteration. 


0 

n 

r 

c 
Figure  4.18.  A  simple  example  CGR  circuit 

2.  Chen  and  Sarrafzadeh  [3]  have  proved  that  their  algorithm  indeed  solves  the 
CGR  problem  optimally.  In  most  realistic  situations,  however,  one  or  more 
of  the  gates  will  have  an  upper  limit  on  the  obtainable  delay.  That  is,  the 
problem  will  really  be  a  CUGR  problem.  The  algorithm  [3]  does  not  obtain 
optimal  solutions  to  the  CUGR  problem.  For  example,  consider  the  circuit 
of  Figure  4.19(a).  The  numbers  above  a  gate  give  the  upper  bound  on  the 
gate's  delay.  This  is  essentially  the  circuit  of  Figure  4.17(a)  with  the  addition 
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of  upper  bounds  on  gate  delay.  The  first  iteration  of  the  algorithm  of  Chen 
and  Sarrafzadeh  [3]  proceeds  exactly  as  it  did  without  the  upper  bounds  and 
we  arrive  at  the  configuration  of  Figure  4.19(b).  For  the  second  iteration,  gates 
b  and  c  are  eliminated  from  Gs  because  their  delays  cannot  be  increased  any 
further.  Gs  is  now  just  a  single  vertex  graph.  Gt  =  Gs  and  the  maximum 
weighted  independent  set  is  {a}.  The  delay  of  gate  a  is  increased  by  1  and 
the  algorithm  terminates  (see  Figure  4.19(c)).  The  power  reduction  obtained 
is  13+14+15  =  42.  However,  the  optimal  power  reduction  of  44  is  obtained  by 
changing  the  delay  of  gate  a  to  3  and  gate  b  to  2,  and  leaving  the  delay  of  gate 
c  at  1  (see  Figure  4.19(d)). 

3.  As  noted  in  Section  4.2,  gates  with  an  upper  bound  on  their  delay  may  be 
modeled  by  gates  with  convex  delay-power  consumption  functions.  Since  the 
algorithm  of  Chen  and  Sarrafzadeh  [3]  does  not  obtain  optimal  solutions  for 
the  CUGR  problem,  it  does  not  obtain  optimal  solutions  for  the  ConvexCGR 
problem. 

4.5.3     A  Unified  Framework  for  CGR.  CUGR  and  ConvexCGR 

The  CGR,  CUGR  and  ConvexCGR  problems  can  all  be  solved  in  pseudopolyno- 
mial  time  by  transforming  the  circuit  into  an  activity  on  edge  PERT  (Performance 
Evaluation  and  Review  Technique)  network  and  then  using  the  algorithm  of  Fulkerson 
[8]  for  project  cost  curves. 
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Figure  4.19.  Application  of  algorithm  of  Chen  and  Sarrafzadeh  [3]  on  CUGR  circuit, 
(a)  An  example  CUGR  circuit;  (b)  delay  of  each  gate  after  first  iteration;  (c)  delay 
of  each  gate  after  algorithm  [3]  terminates;  (d)  delay  of  each  gate  for  optimal  power 
reduction 
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The  PERT  network  G  for  any  circuit  C  is  obtained  as  follows: 

Step  1:  For  each  gate  t'  of  C,  G  contains  two  vertices.  c~  and  v+.  There  is  an  edge 
(v~,v+)  from  vertex  v~  to  v+.  With  this  edge  (r~.t;+),  we  associate  a  triple 
(a(v~,v+),b(v~,v+),c{v~,v+))  where  a(v~,v+)  -  d,(v),  b(v~,v+)  =  d,{v)  + 
s(v)  for  the  CGR  problem  and  b(v~,v+)  =  min{rf,(t>)  +  s(v),du(v)},  where 
du(v)  is  the  upper  bound  on  the  delay  for  gate  v  in  case  we  are  solving  solving  a 
CUGR  problem,  (the  ConvexCGR  case  is  discussed  later),  and  c(v~,v+)  =  c(v). 

Step  2:  For  each  edge  (u.  v)  in  C,  there  is  an  edge  (u~.  t")  in  G.  The  triple  for  this 
edge  is  (0,0,0). 

Step  3:  G  has  two  special  vertices  s  (source)  and  t  (sink).  There  is  an  edge  (s,v~) 
for  every  gate  v  for  which  C  has  a  primary  input.  The  triple  for  this  edge  is 
(max{a(v)},max{a(t')},0),  where  the  maximum  is  taken  over  the  arrival  times 
of  all  primary  inputs  to  v.  Additionally,  there  is  an  edge  (v+,t)  for  all  gates  v 
that  have  a  primary  output.  The  triple  for  this  edge  is  (a,  a,  0)  where  a  =  max{ 
required  times  of  all  primary  outputs}-  (required  time  of  primary  output  of 
gate  v). 

The  PERT  network  for  the  circuit  of  Figure  4.17(a)  is  shown  in  Figure  4.20.  Pairs 
of  vertices  (u_,r+)  are  enclosed  in  broken  boxes.  Edge  triples  are  shown  above  each 
edge.  Numbers  inside  each  vertex  is  their  initial  t()  value,  which  will  be  defined  later. 
The  interpretation  of  an  edge  triple  (a,  b,  c)  is:  a  is  the  smallest  delay  through  the 
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edge,  b  is  the  maximum  delay  through  the  edge,  and  c  is  the  power  reduction  per 
unit  increase  in  delay  in  the  range  a  through  b. 

(0,0,0)  jfy/- ^X   (0,0,0) 

^-^-^^lS^_ _S^Sl     (0,0,0)  (1,1,0) 

(2,2,0)     ^^-OV?\g  ^"J?? 

IV         1 ^\ 

l \-z. r-^  J 

Figure  4.20.  A  PERT  network  for  the  CGR  circuit  of  Figure  4.17(a). 

The  objective  is  to  assign  integer  values  r()  to  the  vertices  of  the  PERT  network, 
and  integer  weights  w{)  to  the  edges  so  as  to 

maximize       JZ    c{x,y)w{x.  y)  (4.4) 

subject  to 

u(x,y)  <  T(y)-T(x)  (4.5) 

a{x,y)  <  w(x,y)<b{x,y)  (4-6) 

t(s)  =  0  (4.7) 

r(t)  <  max{required  time  of  primary  outputs}  (4.8) 
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It  is  easy  to  see  that  the  optimal  solution  to  the  above  integer  linear  program  de- 
fines an  optimal  solution  for  the  power  reduction  problem.  In  this  solution,  the  delay 
of  gate  v  is  t(v+)-t(v~)  and  the  obtained  power  reduction  is  T.(x,y)eE  cix-.  l/M1*  v)- 

The  algorithm  of  Fulkerson  [8]  solves  the  above  linear  program  using  a  primal- 
dual  approach  and  a  network  flow  algorithm.  It  begins  by  setting  w(x,  y)  =  b(x,  y) 
for  each  edge  and  computes  the  smallest  r()  that  satisfy  Equations  4.5-4.7  using  a 
topological  order  scan  of  the  PERT  network  beginning  at  vertex  s.  These  values 
become  the  initial  r()  values.  In  Figure  4.21,  the  number  in  each  vertex  indicates 
this  initial  r  value.  If  the  computed  r(t)  satisfies  Equation  4.8,  we  are  done.  If  not, 
the  w's  are  reduced  using  augmenting  path  methods  until  we  have  an  assignment 
of  w's  and  r's  that  satisfy  all  the  constraints  (Equation  4.5-4.8).  Fulkerson  [8]  has 
extended  his  algorithm  to  the  case  when  c(x,  y)  is  given  by  a  convex  function  for  each 
edge.  This  extension  essentially  increases  the  number  of  edges  in  G.  Therefore  we 
are  able  to  also  solve  the  ConvexCGR  problem  with  this  formulation. 

Although  the  asymptotic  complexity  of  Fulkerson 's  method  [8]  is  the  same  as  that 
of  the  algorithm  of  Chen  and  Sarrafzadeh  [3],  we  expect  Fulkerson 's  algorithm  to  be 
faster  for  the  following  reasons: 

1.  Fulkerson's  algorithm  can  reduce  the  delay  of  gates  by  more  than  1  on  each 
iteration.  For  example,  the  circuit  of  Figure  4.18  is  handled  with  just  one 
iteration. 
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2.  Successive  iterations  of  Fulkerson's  algorithm  use  the  results  of  preceding  it- 
erations; each  iteration  requires  the  computation  of  new  augmenting  paths. 
Successive  iterations  of  the  algorithm  of  Chen  and  Sarrafzadeh  [3]  essentially 
start  from  scratch,  recomputing  Gs,  Gt  and  the  maximum  weighted  indepen- 
dent set. 

4.6    The  General  Gate  Resizing  Problem  (GGR) 

Chen  and  Sarrafzadeh  [3]  have  proposed  a  heuristic  for  the  general  gate  resizing 
problem.  We  show  how  our  methodology  of  the  previous  section  can  be  extended 
to  obtain  a  heuristic  for  the  GGR  problem  with  convex  (delay,  power  consumption) 
pairs.  Let  the  (delay,  power  consumption)  pairs  for  a  gate  be  (dupi),  (d2,P2),  •  •  • ,  (dk-Pk) 
with  dt  <  d2  <■■■  dk  and  pi  >  p?  >■■■>  Pk-  (di,Pi)  is  the  pair  for  the  initially 
selected  gate  size.  The  pairs  are  convex  if  and  only  if  cx  >  c2  >  •  •  •  >  cfc_i  where 
c.  =  (pt  -pi+1)/(di+1  -di).  In  practice,  we  expect  most  GGR  instances  will  be  convex. 
♦  To  solve  GGR  with  convex  pairs,  we  construct  a  PERT  network  as  before.  How- 
ever, each  edge  (v~,  v+)  is  now  a  chain  as  shown  in  Figure  4.21,  where  6i  =  di+\  -  <U- 

(di,«i,ci)  (0,*2,cj)     _     (0,<S3,c3)     ^  _  (0,«fc-i,ct_i) 


Q „Q ""~*@ K^V "*©" 

Figure  4.21.  Transformation  of  vertex  v  into  a  chain  in  PERT  network 

Once  the  network  has  been  solved  using  the  algorithm  of  Fulkerson  of  [8],  the 
flows  in  the  chains  are  adjusted  to  obtain  a  feasible  solution  to  the  GGR  problem. 
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Since  cx  >  c2  >  •  •  ■  >  Cfc_i,  we  may  assume  that,  in  the  optimal  solution,  the  delay 
for  edge  (t'i.r,+i)  is  made  6,  before  the  delay  of  edge  (t'i+1,  tf,+2)  is  increased  above 
0.  For  each  chain,  find  the  rightmost  edge  (vi,vi+l)  whose  delay  exceeds  0.  If  there 
is  no  such  edge,  select  edge  (fi,^)-  If  the  delay  of  the  edge  is  less  than  Si,  set  the 
delay  of  the  gate  v  to  d{\  otherwise  set  it  to  rf,+i. 

4.7    Experimental  Results 

We  implemented  our  CGR  and  GGR  algorithms  as  well  as  those  of  Chen  and 
Sarrafzadeh  [3]  in  C  and  benchmarked  them  on  a  SUN  SPARCstation  5.  All  algo- 
rithms were  coded  using  similar  programming  methodologies  so  that  any  observed 
performance  differences  can  be  attributed  to  algorithmic  differences  rather  than  to 
differences  in  programming  style. 

The  test  circuits  we  used  include  combinational  circuits  from  the  MCNC-91  bench- 
mark suite.  The  library  we  used  includes  NAND,  NOR,  and  INVERTOR  gates.  Each 
gate  has  a  minimum  delay  of  one  clock  cycle.  Technology  mapping  was  done  using 
Berkeley  SIS  and  power  consumption  was  calculated  using  a  5V  supply  voltage  and 
20MHz  clock  frequency.  The  switching  activity  factors  of  individual  gates  were  cal- 
culated using  the  symbolic  simulation  technique  described  in  Ghost  et  al.  [10],  which 
is  implemented  in  Berkeley  SIS. 

Tables  4.1  and  4.2  give  our  experimental  results  for  the  CGR  algorithms.  The 
number  of  gates  in  each  circuit  is  given  by  n;  ta  is  the  length  of  the  critical  path  in  the 
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circuit  (i.e.  length  of  longest  path  from  a  primary  input  to  a  primary  output  when 
gate  delays  equal  the  initially  assigned  delays).  The  runtimes  are  given  in  seconds. 
In  the  experiments  reported  in  Table  4.1,  the  required  time  for  the  output  signal  was 
set  equal  to  the  critical  path  length,  ta,  of  the  circuit. 

Since  both  the  CGR  algorithms  produce  provably  optimal  solutions,  the  only 
difference  between  them  is  run  time.  On  6  of  the  9  tested  circuits,  our  algorithm 
is  noticeably  faster;  on  another  2,  the  two  algorithms  took  about  the  same  time; 
and  on  only  1  of  the  9  circuits  did  the  algorithm  [3]  outperform  our  algorithm.  The 
disparity  between  the  two  algorithms  becomes  more  striking  when  the  required  time 
for  the  output  signal  is  increased  beyond  the  critical  path  length  ta.  The  run  times 
for  the  case  when  the  required  time  is  2ta  are  shown  in  Table  4.2.  Now,  our  algorithm 
provided  a  speedup  between  4  and  11  over  the  algorithm  of  [3]!  Table  4.3  shows  the 
relative  performance  of  the  two  algorithms  as  we  increase  the  required  time  tT  for  one 
of  the  test  circuits  squar5.  The  relative  speedup  provided  by  our  algorithm  increases 
from  a  low  of  0.78  (t,  =  ta;  Table  4.1)  to  a  high  of  42  when  tT  =  100  =  12.5*a. 

Tables  4.4  to  4.9  shows  the  results  for  the  two  GGR  algorithms  using  convex  pairs. 
Our  library  includes  five  implementations  (i.e.  (delay,  power  consumption)  pairs)  for 
each  gate  type  -  NAND,  NOR  and  INVERTOR).  Since  both  the  GGR  algorithms 
are  heuristics,  we  compare  the  power  reductions  obtained  by  each  rather  than  their 
run  times.  The  first  group  of  tables  (Tables  4.4  to  4.6)  show  the  results  for  the  case 
when  tr  =  ta;  the  second  group  of  tables  (Tables  4.7  to  4.9)  are  for  the  case  tr  =  2ta. 
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Table  4.1.  Run  time  and  speedup  when  required  time  is  equal  to  critical  path  length 


circuit 

n 

to 

[3] 

Our 

speedup 

5xpl 

158 

11 

0.39 

0.10 

3.99 

bl2 

147 

13 

0.34 

0.09 

3.91 

clip 

278 

15 

1.28 

0.25 

5.07 

rd73 

219 

17 

1.74 

0.21 

8.37 

sao2 

225 

18 

1.13 

0.21 

5.47 

set 

170 

8 

0.10 

0.09 

1.10 

squar5 

99 

8 

0.03 

0.04 

0.78 

t481 

59 

10 

0.01 

0.01 

1.00 

ttt2 

340 

11 

4.94 

0.47 

10.62 

Table  4.2.  Run  time  and  speedup  when  required  time  is  doubled 


circuit 

n 

ta 

Chen 

Our 

speedup 

5xpl 

158 

11 

1.04 

0.14 

7.22 

bl2 

147 

13 

0.55 

0.12 

4.46 

clip 

278 

15 

4.82 

0.43 

11.16 

rd73 

219 

17 

1.55 

0.29 

5.29 

sao2 

225 

18 

1.61 

0.30 

5.40 

set 

170 

8 

1.29 

0.14 

9.47 

squar5 

99 

8 

0.22 

0.05 

4.29 

t481 

59 

10 

0.16 

0.01 

13.25 

ttt2 

340 

11 

7.17 

0.67 

10.75 

Table  4.3.  Run  time  for  squar5  with  different  required  time 


U 

Chen 

Our 

speedup 

10 

0.076 

0.044 

1.72 

20 

0.315 

0.052 

6.06 

30 

0.555 

0.054 

10.28 

40 

0.795 

0.052 

15.29 

50 

1.033 

0.055 

18.78 

100 

2.235 

0.053 

42.17 
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Within  the  same  group  of  tables,  the  circuits  differ  in  the  selection  of  initial  (delay, 
power  consumption)  pairs  for  the  gates.  As  can  be  seen,  when  tr  -  ta.  our  algorithm 
obtained  a  larger  power  reduction  in  18  of  27  tests;  when  tT  =  2ta,  our  algorithm 
obtained  larger  power  reduction  in  all  27  cases. 

Our  GGR  heuristic  took  between  0.01  seconds  and  5.60  seconds  for  the  test  cases. 
This  is  considerably  more  time  than  required  by  the  heuristic  of  Chen  and  Sarrafzadeh 
[3],  which  took  less  than  0.30  seconds  for  each  test.  However,  the  run  time  of  our 
heuristic  is  reasonable  and  our  heuristic  generally  produces  better  solutions  than 
those  produced  by  the  heuristic  of  Chen  and  Sarrafzadeh  [3]. 

Table  4.4.  Power  reduction  of  GGR  algorithms  (1) 


circuit 

Chen 

Our 

Diff 

%  imp 

5xpl 

47416 

51386 

3970 

8.37 

bl2 

56628 

63191 

6563 

11.59 

clip 

70912 

77252 

6340 

8.94 

rd73 

76374 

88223 

11849 

15.51 

sao2 

88922 

96865 

7943 

8.93 

set 

54756 

53809 

-947 

-1.73 

squarS 

32603 

34481 

1878 

5.76 

t481 

3428 

3466 

38 

1.11 

ttt2 

139626 

151590 

11964 

8.57 

4.8     Conclusion 

We  have  developed  polynomial  time  algorithms  for  the  CGR,  CUGR,  and  Convex- 
CGR  problems  for  series-parallel  and  tree  circuits.  The  CGR  problem  with  multigate 
modules  was  shown  to  be  NP-hard.  We  presented  a  unified  framework  for  the  solution 
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Table  4.5.  Power  reduction  of  GGR  algorithms 

(2) 

circuit 

Chen 

Our 

Diff 

%  imp 

oxpl 

102586 

112719 

10133 

9.88 

bl2 

101174 

108306 

7132 

7.05 

clip 

171917 

194559 

22642 

13.17 

rd73 

147608 

154986 

7378 

5.00 

sao2 

132655 

137355 

4700 

3.54 

set 

98939 

111848 

12909 

13.05 

squar5 

68189 

73090 

4901 

7.19 

t481 

25345 

27461 

2116 

8.35 

ttt2 

231072 

247243 

16171 

7.00 

Table  4.6.  Power  reduction  of  GGR  algorithms  (3) 

circuit 

Chen 

Our 

Diff 

%  imp 

5xpl 

41448 

46050 

4602 

11.10 

bl2 

42716 

45812 

3096 

7.25 

clip 

62986 

66819 

3833 

6.09 

rd73 

63160 

71430 

8270 

13.09 

sao2 

74291 

80777 

6486 

8.73 

set 

46212 

44805 

-1407 

-3.04 

squar5 

26192 

28172 

1980 

7.56 

t481 

3222 

2652 

-570 

-17.69 

ttt2 

112569 

119937 

7368 

6.55 

• 

83 


Table  4.7.  Power  reduction  of  GGR  algorithms  (4) 


circuit 

Chen 

Our 

Diff 

%  imp 

5xpl 

90859 

100867 

10008 

11.01 

bl2 

80015 

84321 

4306 

5.38 

clip 

157222 

171358 

14136 

8.99 

rd73 

120554 

126382 

5828 

4.83 

sao2 

109121 

111934 

2813 

2.58 

set 

79458 

86699 

7241 

9.11 

squar5 

58536 

61768 

3232 

5.52 

t481 

18361 

20373 

2012 

10.96 

ttt2 

177478 

190093 

12615 

7.11 

Table  4.8.  Power  reduction  of  GGR  algorithms  (5) 


circuit 

Chen 

Our 

Diff 

%  imp 

5xpl 

29651 

26047 

-3604 

-12.15 

M2 

31532 

31771 

239 

0.76 

clip 

40721 

39796 

-925 

-2.27 

rd73 

45337 

45015 

-322 

-0.71 

sao2 

53423 

58414 

4991 

9.34 

set 

26175 

22871 

-3304 

-12.62 

squar5 

13123 

13186 

63 

0.48 

t481 

1843 

1622 

-221 

-11.99 

ttt2 

i 

72498 

70742 

-1756  J 

-2.42 
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Table  4.9.  Power  reduction  of  GGR  algorithms  (6) 


circuit 

Chen 

Our 

Diff 

9c  imp 

5xpl 

53356 

67287 

13931 

26.11 

bl2 

57504 

65071 

7567 

13.16 

clip 

101648 

117108 

15460 

15.21 

rd73 

91287 

99225 

7938 

8.70 

sao2 

91862 

97392 

5530 

6.02 

set 

56641 

62326 

5685 

10.04 

squar5 

38067 

41557 

3490 

9.17 

t481 

9782 

14015 

4233 

43.27 

ttt2 

128515 

147654 

19139 

14.89 

of  CGR,  CUGR  and  ConvexCGR  problems  on  general  circuits.  This  framework  can 
also  be  used  to  obtain  a  heuristic  for  the  ConvexGGR  problem.  Experimental  results 
obtained  by  us  indicate  that  our  CGR  algorithm  is  faster  than  the  CGR  algorithm  of 
Chen  and  Sarrafzadeh  [3]  and  that  our  GGR  heuristic  often  obtains  better  solutions 
than  those  obtained  by  the  GGR  heuristic  of  Chen  and  Sarrafzadeh  [3]. 


CHAPTER  5 
CONCLUSIONS  AND  FUTURE  WORK 


We  have  considered  some  problems  that  arise  in  the  automation  of  various  stages 
of  the  VLSI  physical  design  process.  The  first  problem  we  considered  is  transistor 
folding  to  reduce  layout  area.  An  algorithm  was  developed  to  minimize  the  layout 
area.  This  algorithm  outperforms  the  existing  one  both  asymptotically  and  experi- 
mentally. 

We  considered  the  problem  of  selecting  the  implementations  of  two  rows  of  mod- 
ules on  a  routing  channel  so  as  to  satisfy  the  net-span  constraints  as  well  as  minimize 
the  channel  density.  An  algorithm  was  developed  by  applying  the  limited  branching 
method.  Experimental  results  indicate  a  significant  reduction  in  run  time  over  the 
existing  algorithm. 

Another  problem  we  considered  is  low  power  gate  resizing.  We  increase  the  area 
of  gates  to  reduce  the  power  consumption  while  satisfying  the  time  constraint  for  the 
circuit.  Fast  algorithms  were  developed  for  series-parallel  and  tree  circuits  and  variant 
of  the  problem  with  multigate  modules  was  proved  to  be  NP-hard.  We  also  developed 
a  unified  framework  for  the  solution  of  CGR,  CUGR  and  ConvexCGR  problems  on 
general  circuits.  We  used  this  framework  to  obtain  a  heuristic  for  the  ConvexGGR 
problem.    Experimental  results  indicate  a  significant  reduction  in  run  time  for  our 


85 


86 

CGR  algorithm  over  the  existing  algorithm.  Our  ConvexGGR  heuristic  often  obtains 
better  solutions  than  those  obtained  by  the  heuristic  of  Chen  and  Sarrafzadeh  [3]. 

Future  research  on  these  problems  could  include  the  development  of  better  algo- 
rithms for  the  multichannel  2-PDMIS  problem  (especially  for  c  >  2);  the  develop- 
ment of  better  heuristics  for  the  general  single  and  multichannel  PDMIS  problems; 
and  faster  improved  heuristics  for  the  GGR  problem. 
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